March 6, 2025 (Updated on: March 9, 2025)

Building a Lisp-Like Language (1/?) : Building a Lexer

Foreword

Creating a programming language has always fascinated me. This is not an expert’s guide but a learning journey, where I explore the fundamental steps of language design by building a simple Lisp-like language from scratch. Instead of just reading theory, I will experiment, make mistakes and refine my understanding along the way.

This article documents my approach to creating a language syntax that look like Lisp (but simpler) and to writing a lexer/parser. By working through these concepts, I aim to develop both a deeper understanding of programming language theory and a working foundation for my own language.

This is more of a learning journey, figuring things out as I go.

First step, define a language syntax

This language is a Lisp-like programming language designed for educational purposes, let’s name it eLisp (for easy Lisp). It focus on simplicity while retaining essential Lisp features.
It adopts Lisp’s :

parenthesized syntax (or S-expression)
functional paradigm
prefix notation

Whith a big diffrence :

It restricts transitivity in mathematical operations for easier parsing and comprehension

In Lisp, the following code does the transitive multiplication of $n*m*p$ implicitly

(* n m p)

While in eLisp we have to explicitly express the two multiplications such as :

(* n ( * m p))

And finally, the language includes the minimal keyword and functionnality : constructs for defining functions, conditionals, loops, variable assignment and input/output operations.

Here comes a very quick documentation for the language :

Quick documentation of our eLisp syntax

As with traditional Lisp, all expressions are enclosed within parentheses and follow a prefix notation (Polish notation). This eliminates operator precedence concerns and maintains a consistent parsing structure.

Function Definition: define

Functions are defined using the define keyword, followed by the function name, a list of parameters and the function body. The body consists of a single expression, which represents the return value.

(define (fact n)
  (if (< n 2)
      1
      (* n (fact (- n 1)))))

(display (fact 5))

Conditionals: if

The if expression evaluates a condition and executes one of two possible branches based on the result. The first argument is the condition, the second is the expression executed if the condition is true and the third is the expression executed if the condition is false

(if (<= n 1)
    1
    (* n (fact (- n 1))))

Output: display

The display function prints a value or a string to the output.

(display "Hello, World!")
(display (fact 5))

Loops: while and for

`while`

The while loop repeats execution of a block as long as the condition is true. The loop body can consist of multiple expressions.

(set x 0)
(while (< x 10)
  (set x (+ x 1)))

`for`

The for loop iterates over a range of values, with an initial value, an upper bound and an implicit increment of 1.

(for i 0 10
  (display i))

Constants: const

Constants are defined using const, which assigns an immutable value to a symbol.

(const PI 3.14159)

(define (circle-area r)
  (* (* PI r) r))

(display (circle-area 5))

Variable Assignment: set

The set expression assigns a new value to an existing variable.

(set x 5)
(set x (+ x 1))

User Input: read

The read function captures user input and assigns it to a variable.

(display "Enter a number:")
(set x (read))
(display (* x x))

Second step, create a structured representation

The next steps in this process is to transform raw code into a structured representation. This involves two key stages :

Lexical analysis (lexing or lexer) : breaking the source code into tokens, the fundamental building blocks of syntax.
Syntax analysis (parsing or parser) : organizing these tokens into a parse tree, which represents the hierarchical structure of the code according to the language’s grammar.

So basically, for this very basic eLisp code :

(define (fact n)
  (if (<= n 1)
      1
      (* n (fact (- n 1)))))

(display (fact 5))

We would like to structure the raw code into a hierarchical structure such as a tree.

graph TD;
    A(("(")) --> B[define]
    B --> C(("("))
    C --> D[fact]
    C --> E[n]
    C --> F((")"))
    C --> G(("("))
    
    G --> H[if]
    H --> I(("("))
    I --> J["≤"]
    I --> K[n]
    I --> L[1]
    I --> M((")"))
    
    H --> N[1]
    
    H --> O(("("))
    O --> P["×"]  
    O --> Q[n]
    O --> R(("("))
    R --> S[fact]
    R --> T(("("))
    T --> U["−"]  
    T --> V[n]
    T --> W[1]
    T --> X((")"))
    R --> Y((")"))
    O --> Z((")"))
    
    G --> AA((")"))
    C --> AB((")"))

    A --> AC(("("))
    AC --> AD[display]
    AD --> AE(("("))
    AE --> AF[fact]
    AE --> AG[5]
    AE --> AH((")"))
    AC --> AI((")"))

And since we have the syntax rule of our code, we can tokenize a raw eLisp file according to a predefined set of rules we previously established. And that’s what a lexer is. It processes source code and converts it into a series of tokens that a parser can use. So the lexer :

Recognizes identifiers, keywords, numbers, operators and punctuation of the raw code
Ignores whitespace and comments (those are just to make reading easier, but useless for code recognition and execution)
Report errors for unrecognized characters (Don’t need to continue and execute the program if the raw code is wrong at the first place)

So, we let’s code a lexer (in python because the regex library is simple and powerful) that takes eLisp code in input and creates a list of token in output such as :

Token([type of token], [value of the token], [position of the token])
Token([type of token], [value of the token], [position of the token])
Token([type of token], [value of the token], [position of the token])
...

With the following types of tokens :

('NUMBER', r'\d+(\.\d+)?')
('IDENTIFIER', r'[A-Za-z_][A-Za-z0-9_-]*')
('OPERATOR', r'[\+\-\*/<>=]+')
('LPAREN', r'\(')
('RPAREN', r'\)')
('WHITESPACE', r'\s+')
('COMMENT', r';.*')
('UNKNOWN', r'.')

And keywords :

{'if', 'while', 'for', 'set', 'display', 'define', 'const', 'read'}

The lexer code is :

import re

TOKEN_SPECIFICATION = [
    ('NUMBER', r'-?\d+(\.\d+)?'),
    ('STRING', r'"[^"]*"'),
    ('IDENTIFIER', r'[A-Za-z_][A-Za-z0-9_-]*'),
    ('OPERATOR', r'[\+\-\*/<>=^]+'),
    ('LPAREN', r'\('),
    ('RPAREN', r'\)'),
    ('WHITESPACE', r'\s+'),
    ('COMMENT', r';.*'),
    ('UNKNOWN', r'.')
]

KEYWORDS = {'if', 'while', 'for', 'set', 'display', 'define', 'const', 'read'}


class Token:
    __slots__ = ('TokenType', 'TokenValue', 'TokenPosition')

    def __init__(self, TokenType, TokenValue, TokenPosition):
        self.TokenType = TokenType
        self.TokenValue = TokenValue
        self.TokenPosition = TokenPosition

    def __repr__(self):
        return f'Token({self.TokenType}, {repr(self.TokenValue)}, {self.TokenPosition})'


class Lexer:
    def __init__(self, source_code):
        self.source_code = source_code
        self.token_regex = re.compile('|'.join(f'(?P<{name}>{pattern})' for name, pattern in TOKEN_SPECIFICATION))

    def tokenize(self):
        tokens = []
        for match in self.token_regex.finditer(self.source_code):
            token_type = match.lastgroup
            token_value = match.group()
            token_position = match.start()

            if token_type in ('WHITESPACE', 'COMMENT'):
                continue

            if token_type == 'IDENTIFIER' and token_value in KEYWORDS:
                token_type = 'KEYWORD'
            
            if token_type == 'NUMBER':
                token_value = float(token_value) if '.' in token_value else int(token_value)

            if token_type == 'STRING':
                token_value = token_value[1:-1]

            tokens.append(Token(token_type, token_value, token_position))

        return tokens

Let’s try it :

> python.exe main.py code.elisp
Token(LPAREN, '(', 0)
Token(KEYWORD, 'define', 1)
Token(LPAREN, '(', 8)
Token(IDENTIFIER, 'fact', 9)
Token(IDENTIFIER, 'n', 14)
Token(RPAREN, ')', 15)
Token(LPAREN, '(', 19)
Token(KEYWORD, 'if', 20)
Token(LPAREN, '(', 23)
Token(OPERATOR, '<=', 24)
Token(IDENTIFIER, 'n', 27)
Token(NUMBER, 1, 29)
Token(RPAREN, ')', 30)
Token(NUMBER, 1, 38)
Token(LPAREN, '(', 46)
Token(OPERATOR, '*', 47)
Token(IDENTIFIER, 'n', 49)
Token(LPAREN, '(', 51)
Token(IDENTIFIER, 'fact', 52)
Token(LPAREN, '(', 57)
Token(OPERATOR, '-', 58)
Token(IDENTIFIER, 'n', 60)
Token(NUMBER, 1, 62)
Token(RPAREN, ')', 63)
Token(RPAREN, ')', 64)
Token(RPAREN, ')', 65)
Token(RPAREN, ')', 66)
Token(RPAREN, ')', 67)
Token(LPAREN, '(', 70)
Token(KEYWORD, 'display', 71)
Token(LPAREN, '(', 79)
Token(IDENTIFIER, 'fact', 80)
Token(NUMBER, 5, 85)
Token(RPAREN, ')', 86)
Token(RPAREN, ')', 87)

Cool, the lexer successful tokenized our eLisp raw code. The next step is to convert tokens into an AST (Abstract Syntax Tree).

Building a Lisp-Like Language (1/?) : Building a Lexer

Foreword

First step, define a language syntax

Quick documentation of our eLisp syntax

Function Definition: `define`

Conditionals: `if`

Output: `display`

Loops: `while` and `for`

`while`

`for`

Constants: `const`

Variable Assignment: `set`

User Input: `read`

Second step, create a structured representation

Building a Lisp-Like Language (1/?) : Building a Lexer

Foreword

First step, define a language syntax

Quick documentation of our eLisp syntax

Function Definition: define

Conditionals: if

Output: display

Loops: while and for

while

for

Constants: const

Variable Assignment: set

User Input: read

Second step, create a structured representation

Function Definition: `define`

Conditionals: `if`

Output: `display`

Loops: `while` and `for`

`while`

`for`

Constants: `const`

Variable Assignment: `set`

User Input: `read`