Building a Lisp-Like Language (2/?) : AST and parser

Concrete syntex tree (AST)

In the previous post, I built a lexer that breaks my eLisp language source code into tokens. The next step is to create a parser : The goal is to parse these tokens into a structured format that an interpreter can process (probably a futur post). This involves constructing an Abstract Syntax Tree (AST). The AST represents the hierarchical structure of an eLisp program. It captures the relationships between elements, ensuring correct execution.

Let’s go back to our previous example :

(define (fact n)
  (if (<= n 1)
      1
      (* n (fact (- n 1)))))

The AST of this program has to look like below to ensures that execution follows a structured and recursive process :

graph TD;
    A["Program"] --> B["Body"]
    
    B --> C["FunctionDefinition"]
    C --> D["Head"]
    D --> E["FunctionName"]
    E --> F["fact"]
    D --> G["Parameters"]
    G --> H["n"]
    
    C --> I["Body"]
    I --> J["IfStatement"]
    
    J --> K["Head"]
    K --> L["FunctionName"]
    L --> M["if"]
    
    J --> N["Arguments"]
    N --> O["IfCondition"]
    O --> P["ConditionExpression"]
    P --> Q["FunctionCall"]
    
    Q --> R["Head"]
    R --> S["FunctionName"]
    S --> T["≤"]
    Q --> U["Arguments"]
    U --> V["ParameterReference"]
    V --> W["n"]
    U --> X["Literal"]
    X --> Y["1"]
    
    N --> Z["ThenBranch"]
    Z --> AA["Literal"]
    AA --> AB["1"]
    
    N --> AC["ElseBranch"]
    AC --> AD["FunctionCall"]
    
    AD --> AE["Head"]
    AE --> AF["FunctionName"]
    AF --> AG["×"]
    
    AD --> AH["Arguments"]
    AH --> AI["ParameterReference"]
    AI --> AJ["n"]
    
    AH --> AK["FunctionCall"]
    AK --> AL["Head"]
    AL --> AM["FunctionName"]
    AM --> AN["fact"]
    
    AK --> AO["Arguments"]
    AO --> AP["FunctionCall"]
    
    AP --> AQ["Head"]
    AQ --> AR["FunctionName"]
    AR --> AS["−"]
    
    AP --> AT["Arguments"]
    AT --> AU["ParameterReference"]
    AU --> AV["n"]
    AT --> AW["Literal"]
    AW --> AX["1"]
    
    B --> AY["FunctionCall"]
    AY --> AZ["Head"]
    AZ --> BA["FunctionName"]
    BA --> BB["display"]
    
    AY --> BC["Arguments"]
    BC --> BD["FunctionCall"]
    BD --> BE["Head"]
    BE --> BF["FunctionName"]
    BF --> BG["fact"]
    
    BD --> BH["Arguments"]
    BH --> BI["Literal"]
    BI --> BJ["5"]

Keywords and unified representation of functions

In the previous article, I selected a minimal set of keywords (define, if, set, while, for, const, display, and read) that capture essential programming concepts without overwhelming learners and thanks to that, I decided to structure my Abstract Syntax Tree (AST) with clear and intuitive nodes, each representing distinct language constructs:

  • Program: The root node, holding all the expressions in the language.
  • FunctionDefinition: Clearly separates function declarations (Head) from their executable content (Body).
  • FunctionCall: Represents invocation of built-in and user-defined functions uniformly.
  • Arguments: Groups the arguments passed to functions or expressions.
  • Literal: Represents constant values such as numbers and strings.
  • ParameterReference: Identifies variables or parameters used within expressions.
  • SetExpression: Clearly defines variable assignments and updates.
  • WhileExpression: Represents loop constructs that repeat as long as a condition is true.
  • ForExpression: Provides a clear structure for loop constructs with a predefined iteration range.
  • IfExpression: Represents conditional logic with clearly separated conditions, then-branch, and else-branch.
  • ConstExpression: Defines immutable variables that cannot be altered after initialization.

I explicitly chose to separate Head (declarations such as function names and parameters) and Body (executable code) in my AST. This design mirrors traditional compiler theory and lambda calculus-based languages, helping me to understand the difference between defining functions or loops and executing them.

Finally, an important decision I made was to treat built-in functions (like display and read) exactly the same as user-defined functions (fact, square, etc.) in the AST : represented them both simply as FunctionName. I wanted to clearly see the elegance and simplicity of Lisp: every function call, whether built-in or user-defined, follows the same conceptual structure. Semantic distinctions between built-ins and user-defined functions are deferred until interpretation.

Recursive descent approach parser

There is several way to parse a source code. I choose to use a recursive descent approach. This is a top-down parsing technique that processes an input string based on a context-free grammar (CFG) like eLisp.
I won’t detail the recursive descent parser technique in detail but here the main ideas :

  • Each function corresponds to a grammar rule and is responsible for recognizing and processing that part of the input.
  • The parser reads tokens from left to right, attempting to match the current input token with expected patterns based on the grammar.
  • Recursion naturally handles nested structures, allowing the parser to process complex expressions in a structured way.
  • If a function successfully parses its corresponding structure, it returns an abstract representation (an AST).
  • If parsing fails, the function either backtracks or reports an error.

And obviously, the recursive descent approach is particularly well-suited for languages with nested and hierarchical structures, such as Lisp-like languages, where expressions are recursively defined.

For example :

(+ (* 2 3) (- 5 1))

The tree-like structure would be

      (+)
     /   \
   (*)   (-)
  /  \   /  \
 2   3  5    1
  • It first processes the outer + expression.
  • It then recursively processes the subexpressions (* 2 3) and (- 5 1).
  • This mirrors the natural way Lisp expressions are structured, making recursive descent a fitting choice.
    (Note that recursive descent approach is not a suttable for all language.)

Parser

So, based on my comprehension of the recursive descent approach and the concrete syntex tree I wish to have, let’s modify the lexer :

lexer.py

import re

TOKEN_SPECIFICATION = [
    ('NUMBER', r'-?\d+(\.\d+)?'),
    ('STRING', r'"[^"]*"'),
    ('IDENTIFIER', r'[A-Za-z_][A-Za-z0-9_-]*'),
    ('OPERATOR', r'[\+\-\*/<>=^]+'),
    ('LPAREN', r'\('),
    ('RPAREN', r'\)'),
    ('WHITESPACE', r'\s+'),
    ('COMMENT', r';.*'),
    ('UNKNOWN', r'.')
]

KEYWORDS = {'if', 'while', 'for', 'set', 'display', 'define', 'const', 'read'}


class Token:
    __slots__ = ('TokenType', 'TokenValue', 'TokenPosition')

    def __init__(self, TokenType, TokenValue, TokenPosition):
        self.TokenType = TokenType
        self.TokenValue = TokenValue
        self.TokenPosition = TokenPosition

    def __repr__(self):
        return f'Token({self.TokenType}, {repr(self.TokenValue)}, {self.TokenPosition})'


class Lexer:
    def __init__(self, source_code):
        self.source_code = source_code
        self.token_regex = re.compile('|'.join(f'(?P<{name}>{pattern})' for name, pattern in TOKEN_SPECIFICATION))

    def tokenize(self):
        tokens = []
        for match in self.token_regex.finditer(self.source_code):
            token_type = match.lastgroup
            token_value = match.group()
            token_position = match.start()

            if token_type in ('WHITESPACE', 'COMMENT'):
                continue

            if token_type == 'IDENTIFIER' and token_value in KEYWORDS:
                token_type = 'KEYWORD'
            
            if token_type == 'NUMBER':
                token_value = float(token_value) if '.' in token_value else int(token_value)

            if token_type == 'STRING':
                token_value = token_value[1:-1]

            tokens.append(Token(token_type, token_value, token_position))

        return tokens

Now, we need a parser that consumes tokens from the lexer and produces an AST :

parser.py

class ASTNode:
    def __init__(self, node_type, value=None, children=None):
        self.node_type = node_type
        self.value = value
        self.children = children or []

    def __repr__(self):
        return f'ASTNode({self.node_type}, {repr(self.value)}, {self.children})'


class Parser:
    SPECIAL_FORMS = {'define', 'if', 'set', 'while', 'for', 'const'}
    BUILTIN_FUNCTIONS = {'display', 'read'}

    def __init__(self, tokens):
        self.tokens = tokens
        self.current_token_index = 0
        self.defined_constants = set()

    def peek(self):
        return self.tokens[self.current_token_index] if self.current_token_index < len(self.tokens) else None

    def consume(self):
        token = self.peek()
        if token:
            self.current_token_index += 1
        return token

    def parse(self):
        body = []
        while self.peek():
            if self.peek().TokenType == 'RPAREN':
                self.consume()
                continue
            body.append(self.parse_expression())
        return ASTNode("Program", children=body)

    def parse_expression(self):
        token = self.peek()
        if token.TokenType in {'RPAREN', None}:
            self.consume()
            return None
        if token.TokenType == 'LPAREN':
            return self.parse_list()
        if token.TokenType in {'NUMBER', 'STRING'}:
            return ASTNode("Literal", self.consume().TokenValue)
        if token.TokenType == 'IDENTIFIER':
            return self.consume_identifier()
        if token.TokenType in {'OPERATOR', 'KEYWORD'}:
            return self.parse_list()
        raise SyntaxError(f"Unexpected token: {token}")

    def consume_identifier(self):
        token = self.consume()
        return ASTNode("ConstantReference" if token.TokenValue in self.defined_constants else "ParameterReference", token.TokenValue)

    def parse_list(self):
        self.consume()
        head_token = self.peek()
        if head_token.TokenType == 'LPAREN':
            return self.parse_expression()
        head_token = self.consume()
        if head_token.TokenType in {'IDENTIFIER', 'OPERATOR'}:
            return self.parse_function_call(head_token.TokenValue)
        if head_token.TokenType == 'KEYWORD':
            return self.parse_keyword_expression(head_token.TokenValue)
        raise SyntaxError(f"Unexpected token in list head: {head_token}")

    def parse_keyword_expression(self, keyword):
        parse_methods = {
            'define': self.parse_define,
            'if': self.parse_if,
            'set': self.parse_set,
            'while': self.parse_while,
            'for': self.parse_for,
            'const': self.parse_const,
        }
        if keyword in parse_methods:
            return parse_methods[keyword]()
        if keyword in self.BUILTIN_FUNCTIONS:
            return self.parse_function_call(keyword)
        raise SyntaxError(f"Unknown keyword: {keyword}")

    def parse_define(self):
        self.consume()
        func_name_token = self.consume()
        if func_name_token.TokenType != 'IDENTIFIER':
            raise SyntaxError("Expected function name in define")
        parameters = self.parse_parameters()
        body = self.parse_body()
        return ASTNode("FunctionDefinition", children=[
            ASTNode("Head", children=[
                ASTNode("FunctionName", func_name_token.TokenValue),
                ASTNode("Parameters", children=parameters)
            ]),
            ASTNode("Body", children=body)
        ])

    def parse_const(self):
        var = self.consume()
        if var.TokenType != 'IDENTIFIER':
            raise SyntaxError("Expected identifier after 'const'")
        value = self.parse_expression()
        self.consume()
        self.defined_constants.add(var.TokenValue)
        return ASTNode("ConstExpression", children=[ASTNode("ConstantIdentifier", var.TokenValue), value])

    def parse_if(self):
        return self.parse_control_structure("IfStatement", ["IfCondition", "ThenBranch", "ElseBranch"])

    def parse_set(self):
        var = self.consume()
        if var.TokenType != 'IDENTIFIER':
            raise SyntaxError("Expected identifier after 'set'")
        value = self.parse_expression()
        self.consume()
        return ASTNode("SetExpression", children=[ASTNode("ParameterReference", var.TokenValue), value])

    def parse_while(self):
        return self.parse_control_structure("WhileExpression", ["Condition", "Body"])

    def parse_for(self):
        loop_var = self.consume()
        if loop_var.TokenType != 'IDENTIFIER':
            raise SyntaxError("Expected loop variable identifier in 'for'")
        start_expr = self.parse_expression()
        end_expr = self.parse_expression()
        body = self.parse_body()
        return ASTNode("ForExpression", children=[
            ASTNode("Head", children=[ASTNode("FunctionName", "for")]),
            ASTNode("Arguments", children=[
                ASTNode("LoopVariable", children=[ASTNode("ParameterReference", loop_var.TokenValue)]),
                ASTNode("StartExpression", children=[start_expr]),
                ASTNode("EndExpression", children=[end_expr])
            ]),
            ASTNode("Body", children=body)
        ])

    def parse_control_structure(self, structure_name, branch_names):
        branches = [self.parse_expression() for _ in range(len(branch_names))]
        self.consume()
        return ASTNode(structure_name, children=[
            ASTNode("Head", children=[ASTNode("FunctionName", structure_name.lower())]),
            ASTNode("Arguments", children=[ASTNode(name, children=[expr]) for name, expr in zip(branch_names, branches)])
        ])

    def parse_function_call(self, func_name):
        args = self.parse_body()
        return ASTNode("FunctionCall", children=[
            ASTNode("Head", children=[ASTNode("FunctionName", func_name)]),
            ASTNode("Arguments", children=args)
        ])

    def parse_parameters(self):
        parameters = []
        while self.peek().TokenType != 'RPAREN':
            param = self.consume()
            if param.TokenType != 'IDENTIFIER':
                raise SyntaxError("Expected parameter identifier")
            parameters.append(ASTNode("ParameterReference", param.TokenValue))
        self.consume()
        return parameters

    def parse_body(self):
        body = []
        while self.peek() and self.peek().TokenType != 'RPAREN':
            body.append(self.parse_expression())
        self.consume()
        return body

Let’s connect everything and let’s try with the factorial eLisp code (store in example.elisp) :

main.py

import argparse
from lexer import Lexer
from parser import Parser


def pretty_print_ast(node, indent=0):
    """Recursively prints the AST with indentation."""
    spacing = "  " * indent
    print(f"{spacing}{node.node_type}: {node.value}" if node.value else f"{spacing}{node.node_type}")
    for child in node.children:
        pretty_print_ast(child, indent + 1)


def read_source_file(filename):
    """Reads the source code from a given file."""
    try:
        with open(filename, "r", encoding="utf-8") as file:
            return file.read()
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        exit(1)
    except Exception as e:
        print(f"Error reading file '{filename}': {e}")
        exit(1)


def main():
    """Main function to tokenize, parse, and print the AST for Lisp-like code."""
    arg_parser = argparse.ArgumentParser(description="Parse Lisp-like code into an AST.")
    arg_parser.add_argument("filename", help="Path to Lisp-like input file.")
    args = arg_parser.parse_args()

    source_code = read_source_file(args.filename)

    lexer = Lexer(source_code)
    tokens = lexer.tokenize()

    parser = Parser(tokens)
    ast = parser.parse()

    pretty_print_ast(ast)


if __name__ == "__main__":
    main()

With our factorial example eLisp code, the output is :

python3 main.py example.elisp   
Program
  FunctionDefinition
    Head
      FunctionName: fact
      Parameters
        ParameterReference: n
    Body
      IfStatement
        Head
          FunctionName: ifstatement
        Arguments
          IfCondition
            FunctionCall
              Head
                FunctionName: <
              Arguments
                ParameterReference: n
                Literal: 2
          ThenBranch
            Literal: 1
          ElseBranch
            FunctionCall
              Head
                FunctionName: *
              Arguments
                ParameterReference: n
                FunctionCall
                  Head
                    FunctionName: fact
                  Arguments
                    FunctionCall
                      Head
                        FunctionName: -
                      Arguments
                        ParameterReference: n
                        Literal: 1
  FunctionCall
    Head
      FunctionName: display
    Arguments
      FunctionCall
        Head
          FunctionName: fact
        Arguments
          Literal: 5

It respects the designed AST. But this is far from been finished. The next steps will be to perform a semantic analysis to check for errors not caught by parsing (undefined variables, type mismatches, incorrect number of arguments …) and an nterpreter that traverse the AST and evaluate expressions. The goal is to execute the elisp code directly and produce results.