Building an Interpreter: Introduction and Lexer

Background

A while back, I found myself interested in the inner workings of programming languages, the compiler, interpreter, and all that stuff. I then decided to follow the book Crafting Interpreters and write a small interpreter in Kotlin (instead of Java used in the book) for a course project. Few years later, I used Java to rewrite basically the same project, but with newly designed syntax for powering the programmable part of a calculator project, made for the last programming course I had at Vanier College, a project driven course. The programmable calculator language which was later separated and reimplemented to become CalcLox (so it can be showcased without revealing the actual course relevant details).

I took a break from intepreters during my first term at University of Waterloo, and then decided to pick it back up. For this time, I want to finish up a language following a book, and then design and implement a language from scratch, with my own syntax and features, and hopefully make it self-hosting one day.

The plan is simple, at the end, design a sort-of-C-style language, based on the syntax of C, Lox (from Crafting Interpreters), Monkey (from Writing An Interpreter In Go and Writing A Compiler In Go), and some other languages I appreciate (Kotlin, Rust, maybe also something else, I don't know yet). An interpreter will first be implemented in Rust, and then a bytecode compiler and VM (with a combination of Rust and Zig), and further work will be done based on the bytecode VM implementation, till it becomes self-hosting.

Before all these happen, I want to first refresh my knowledge, learn about writing bytecode compiler and VM, and also get familiar with the languages. There are 4 projects that spawn off this plan:

An intepreter for Lox, following the book Crafting Interpreters, implemented in Kotlin (again, since I lost the first one I wrote). Named kotlox.
A bytecode compiler and VM for Lox, following the book Crafting Interpreters, implemented in Zig. Named ziglox.
An interpreter for Monkey, following the book Writing An Interpreter In Go, implemented in Rust. Named monkey-interpreter.
A bytecode compiler and VM for Monkey, following the book Writing A Compiler In Go, implemented in Rust and C. Named monkey-compiler-and-vm.

Introduction to this Series

I will not be publishing much about the Lox-based projects since I believe noting down the process of following Writing An Interpreter In Go and Writing A Compiler In Go is sufficient and more interesting (as Rust and C are comparatively more different to Go than the difference between Kotlin and Java). The series, Building an Interpreter, will be focused on the journey I had writing the Monkey interpreter in Rust following the Writing An Interpreter In Go book, and a subsequent series, Building a Compiler, will focus on Writing A Compiler In Go and the Monkey bytecode compiler and VM project in Rust and C.

I will not be covering every details of the process either, but I would love to share some of the thoughts and insights I had during the process, and also some of the code snippets that I find interesting. The goal is to share my learning process and hopefully inspire others to learn about programming languages and interpreters. Let's get started!

Building the Lexer

This article covers Chapters 1.1 to 1.3 of the book, and commits 7181291...f275182 of the monkey-interpreter code base.

The first step to build a language interpreter is to build a lexer, which is responsible for taking the source code as input and producing a stream of tokens as output. A token is a meaningful unit of the source code, such as a keyword, an identifier, a literal, or an operator. The lexer will also be responsible for skipping over whitespace and comments. So:

let x = 5 + 10;

should be tokenized into:

[
  Let,
  Identifier("x"),
  Assign,
  Integer(5),
  Plus,
  Integer(10),
  Semicolon
]

so as

let     x= 5   +  10 ;

The list of tokens above is NOT the actual output of the implemented lexer (more info, such as the literal value of the token, is also included in the actual output), but it gives a general idea of what the lexer is supposed to do. The lexer should also be easily extendable, so that we can add more tokens (and hence features) to the language in the future.

The Initial Token Set

The first lexer should be able to handle the following subset of the Monkey language:

let five = 5;
let ten = 10;

let add = fn(x, y) {
  x + y;
};

let result = add(five, ten);

The following definition of token and token types is sufficient for the lexer to handle the above code:

rust

#[derive(Debug, Eq, PartialEq)]
pub enum TokenType {
    Illegal,
    Eof,

    // Identifiers + Literals
    Identifier,         // foobar, x, y, fn, ...
    Integer,            // 65536

    // Operators
    Assign,             // =
    Plus,               // +

    // Delimiters
    Comma,              // ,
    Semicolon,          // ;
    LeftParenthesis,    // (
    RightParenthesis,   // )
    LeftBrace,          // {
    RightBrace,         // }

    // Keywords
    Function,           // fn
    Let,                // let
}

#[derive(Debug)]
pub struct Token {
    pub token_type: TokenType,
    pub literal: String,
}

Note that Illegal and Eof are special token types that are used to indicate an illegal token (a token that is not recognized by the lexer) and the end of the file (when the lexer has reached the end of the source code), respectively.

Handling the Initial Token Set

A very basic test is written to test the very basic functionality of the lexer. With the input

rust

let input = "=+,(){}.;";

the expected output is

rust

let expected_output = vec![
    Token { token_type: TokenType::Assign, literal: "=".to_string() },
    Token { token_type: TokenType::Plus, literal: "+".to_string() },
    Token { token_type: TokenType::Comma, literal: ",".to_string() },
    Token { token_type: TokenType::LeftParenthesis, literal: "(".to_string() },
    Token { token_type: TokenType::RightParenthesis, literal: ")".to_string() },
    Token { token_type: TokenType::LeftBrace, literal: "{".to_string() },
    Token { token_type: TokenType::RightBrace, literal: "}".to_string() },
    Token { token_type: TokenType::Illegal, literal: ".".to_string() },
    Token { token_type: TokenType::Semicolon, literal: ";".to_string() },
    Token { token_type: TokenType::Eof, literal: "".to_string() },
];

A tokenizer (lexer) is then needed to actually produce the above output from the input string. The implementation to handle this input is actually straightforward, since the tokenizer only needs to store the input string, a current read position, and a current character, and all tokens to be processed are single-character tokens. The implementation of the next_token method is also straightforward, as it only needs to match the current character with the expected single-character tokens and return the corresponding token. There are only two special cases to handle:

if the current character is not recognized as a valid token, the lexer should return an Illegal token with the literal value of the current character.
if the current character is None, which indicates the end of the file, the lexer should return an Eof token with an empty literal value.

And to handle the intended input, which is

rust

let input = "let five = 5;
             let ten = 10;

             let add = fn(x, y) {
                 x + y;
             };

             let result = add(five, ten);";

that produces

rust

let expected_output = vec![
    Token { token_type: TokenType::Let, literal: "let".to_string() },
    Token { token_type: TokenType::Identifier, literal: "five".to_string() },
    Token { token_type: TokenType::Assign, literal: "=".to_string() },
    Token { token_type: TokenType::Integer, literal: "5".to_string() },
    Token { token_type: TokenType::Semicolon, literal: ";".to_string() },
    Token { token_type: TokenType::Let, literal: "let".to_string() },
    Token { token_type: TokenType::Identifier, literal: "ten".to_string() },
    Token { token_type: TokenType::Assign, literal: "=".to_string() },
    Token { token_type: TokenType::Integer, literal: "10".to_string() },
    Token { token_type: TokenType::Semicolon, literal: ";".to_string() },
    Token { token_type: TokenType::Let, literal: "let".to_string() },
    Token { token_type: TokenType::Identifier, literal: "add".to_string() },
    Token { token_type: TokenType::Assign, literal: "=".to_string() },
    Token { token_type: TokenType::Function, literal: "fn".to_string() },
    Token { token_type: TokenType::LeftParenthesis, literal: "(".to_string() },
    Token { token_type: TokenType::Identifier, literal: "x".to_string() },
    Token { token_type: TokenType::Comma, literal: ",".to_string() },
    Token { token_type: TokenType::Identifier, literal: "y".to_string() },
    Token { token_type: TokenType::RightParenthesis, literal: ")".to_string() },
    Token { token_type: TokenType::LeftBrace, literal: "{".to_string() },
    Token { token_type: TokenType::Identifier, literal: "x".to_string() },
    Token { token_type: TokenType::Plus, literal: "+".to_string() },
    Token { token_type: TokenType::Identifier, literal: "y".to_string() },
    Token { token_type: TokenType::Semicolon, literal: ";".to_string() },
    Token { token_type: TokenType::RightBrace, literal: "}".to_string() },
    Token { token_type: TokenType::Semicolon, literal: ";".to_string() },
    Token { token_type: TokenType::Let, literal: "let".to_string() },
    Token { token_type: TokenType::Identifier, literal: "result".to_string() },
    Token { token_type: TokenType::Assign, literal: "=".to_string() },
    Token { token_type: TokenType::Identifier, literal: "add".to_string() },
    Token { token_type: TokenType::LeftParenthesis, literal: "(".to_string() },
    Token { token_type: TokenType::Identifier, literal: "five".to_string() },
    Token { token_type: TokenType::Comma, literal: ",".to_string() },
    Token { token_type: TokenType::Identifier, literal: "ten".to_string() },
    Token { token_type: TokenType::RightParenthesis, literal: ")".to_string() },
    Token { token_type: TokenType::Semicolon, literal: ";".to_string() },
    Token { token_type: TokenType::Eof, literal: "".to_string() },
];

The implementation to handle this input is slightly more complicated, since multi-char tokens now needs to be handled. Therefore, methods to identify if the current "token" is a valid identifier, to read an identifier, and to read a number are needed. White spaces also needs to be properly handled without affecting the tokenization process such as accidentally skipping over a character that should be part of a token.

All these logics replaces the default case that was a simple Illegal token. Now, that entire chunk of code checks first if there are white spaces to skip and if an identifier/integer is being read, and only if both of those checks fail, the token will be determined as an Illegal token.

Something to pay attention to is that this introduces some code nesting in the next_token method, and proper return statements are needed to avoid accidentally falling through to the Illegal token case or skipping over characters that should be part of a token.

Next Steps

In the next part of this series, a complete lexer will be implemented to handle the entire Monkey language, and the initial steps of building the REPL (Read-Eval-Print Loop) will be covered. Stay tuned!