Kyle Kloberdanz's Blog

Home

Let's build a Lisp interpreter in Rust! (Part 1 - Lexing)

13 May 2020

Many interpreters rely on external tooling, such as parser generators. In this tutorial, we will learn things the hard way by coding everything by hand!

But before we get to that, we need to define the basics. What is an interpreter?

An interpreter is a program that reads text, then executes that text as a program.

Interpreters are closely related to compilers. A compiler is a program that goes a step further. Compilers translate the text from one programming language into another language. Commonly, compilers translate the text of a high level language, such as Rust, into the machine code that a computer can execute. Compilers can have targets other than machine code. Java, for example, translates Java code into code that a virtual machine called the JVM can execute. TypeScript utilizes a compiler that translates TypeScript code into JavaScript, which is then translated into byte code for your browser's JavaScript virtual machine (V8 for Chrome, or SpiderMonkey for Firefox).

Interpreters and compilers share several things in common. If you build an interpreter, you are about half way to having a compiler.

Compilers are implemented in two major parts, commonly referred to as the front end and back end. The front end is responsible for dealing with text, while the back end is responsible for generating the output code. Each of these parts are composed of smaller parts, which commonly looks something like this.

Front End
      • Lexer
      • Parser
Back End
      • Optimization
      • Code Generation

Lexing

The lexer or tokenizer is responsible for taking a string of text, and converting it into tokens, which are the smallest elements of a program that carry some amount of meaning. Tokens in a formal language are the equivalent of words in a natural language. Take this sentence for instance.

How now brown cow?

If you ran this sentence through a lexer, you would get the list of words

["How", "now", "brown", "cow", "?"]

Let's take a look at this C program below, and see what would happen if we tokenized it.

      1 int gcd(int a, int b) {
      2     if (b == 0) {
      3         return a;
      4     } else {
      5         return gcd(b, a % b);
      6     }
      7 }

["int", "gcd", "(", "int", "a", ",", "int", "b", ")", "{",...]

Let's begin looking at the syntax that we will write a parser for. In this tutorial we will be making a Lisp like programming language. Since Lisp (List Processor) is also the name of a speech impediment... in that spirit, let's call this language Stutter!

Here we can see the GCD function from before, but now implemented in Stutter.

      1 (def gcd
      2   (lambda (a b)
      3     (if (= 0 b)
      4       a
      5       (gcd b (mod a b)))))

First things first, we need to create the project using the cargo program that comes bundled with Rust. If you do not have Rust installed already, you can get it here

    $ cargo new stutter --bin
         Created binary (application) `stutter` package
    

Since this program won't be very long, for now let's write all the code in main.rs

We need to have a way to represent each token within Rust. This is a perfect place to use an enum. We'll create an enum called Token that can be used to represent every possible construct within our programming language.

    #[derive(Clone, Debug, PartialEq)]
    enum Token {
        Lparen,      // (
        Rparen,      // )
        Plus,        // +
        Minus,       // -
        Times,       // *
        Slash,       // /
        Percent,     // %
        Pow,         // pow
        Gt,          // >
        Lt,          // <
        Eq,          // =
        Gte,         // >=
        Lte,         // <=
        Let,         // let
        Def,         // def
        List,        // list
        Index,       // index
        Drop,        // drop
        Quote,       // quote
        Append,      // append
        Range,       // range
        Cat,         // cat
        Len,         // len
        Take,        // take
        If,          // if
        Int(BigInt), // Integer literal
        Dec(f64),    // Floating point literal
        Bool(bool),  // Boolean literal
        Id(String),  // identifier (variable name or function name)
    }   
  

We will want our tokenizer to recognize each symbol that can stand on its own. Let's write code that can deduce each symbol.

As a bonus, because Rust allows one to easily plug in additional libraries, for our integer type, we can use the num-bigint to represent our integers, so that way they won't be limited to just 64 bits. This crate allows the integers to be arbitrarily large... just as long as they still fit into our computer's memory. Our laptop Turing Machines don't yet have infinite tape.

Add the following lines to your Cargo.toml file to download the num-bigint dependencies.

[dependencies]
num-bigint = "0.2"
num-traits = "0.2"
    

And now include them in main.rs

extern crate num_bigint;
extern crate num_traits;

use num_bigint::BigInt;

Where were we? Oh yes! Back to the tokenizer! We'll make a function that reads in a string, and returns the appropriate kind of token for that string.

This code first tries to parse the string as a BigInt. If that fails, it then tries to parse it as a 64 bit float. If that fails, it moves on to bool, then it will try a list of various reserved words, finally culminating in an identifier, after all else fails.

An identifier is something like a function name or a variable name in a programming language.

fn to_token(s: &String) -> Token {
    let bytes = s.as_bytes();
    if let Some(t) = BigInt::parse_bytes(bytes, 10) {
        Token::Int(t)
    } else if let Ok(t) = s.parse::<f64>() {
        Token::Dec(t)
    } else if let Ok(t) = s.parse::<bool>() {
        Token::Bool(t)
    } else {
        match s.as_ref() {
            "(" => Token::Lparen,
            ")" => Token::Rparen,
            "+" => Token::Plus,
            "-" => Token::Minus,
            "*" => Token::Times,
            "/" => Token::Slash,
            "%" => Token::Percent,
            "pow" => Token::Pow,
            "<" => Token::Lt,
            ">" => Token::Gt,
            "=" => Token::Eq,
            ">=" => Token::Gte,
            "<=" => Token::Lte,
            "let" => Token::Let,
            "def" => Token::Def,
            "list" => Token::List,
            "take" => Token::Take,
            "if" => Token::If,
            "index" => Token::Index,
            "drop" => Token::Drop,
            "quote" => Token::Quote,
            "append" => Token::Append,
            "range" => Token::Range,
            "cat" => Token::Cat,
            "len" => Token::Len,
            _ => Token::Id(s.to_string()),
        }
    }
}

Let's test out this code in main, and give it some example tokens.

fn main() {
    println!("{:?}", to_token(&"(".to_string()));
    println!("{:?}", to_token(&")".to_string()));
    println!("{:?}", to_token(&"def".to_string()));
    println!("{:?}", to_token(&"+".to_string()));
    println!("{:?}", to_token(&"myfunc".to_string()));
    println!("{:?}", to_token(&"myvar".to_string()));
    println!("{:?}", to_token(&"3.14".to_string()));
    println!("{:?}", to_token(&"42".to_string()));
}

When we run it

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.55s
     Running `target/debug/stutter`
Lparen
Rparen
Def
Id("myfunc")
Id("myvar")
Dec(3.14)
Int(BigInt { sign: Plus, data: BigUint { data: [42] } })

Looks good! Now we need a function that can take a string and split it into individuals tokens.

fn lex(cmd: &String) -> Result<Vec<Token>, String> {
    let mut tokens = Vec::new();
    let mut tok = String::new();
    let mut in_comment = false;
    for c in cmd.chars() {
        if in_comment {
            continue;
        }
        match c {
            ';' => in_comment = true,

            '\n' => in_comment = false,

            ')' => {
                if tok.len() > 0 {
                    tokens.push(to_token(&tok));
                    tok = String::new();
                }
                tokens.push(to_token(&String::from(")")));
            }

            '(' => {
                if tok.len() > 0 {
                    tokens.push(to_token(&tok));
                    tok = String::new();
                }
                tokens.push(to_token(&String::from("(")));
            }

            ' ' => {
                if tok.len() > 0 {
                    tokens.push(to_token(&tok));
                    tok = String::new();
                }
            }

            _ => {
                tok.push(c);
            }
        }
    }

    if tok.len() > 0 {
        Err(format!("Invalid syntax, token not matched: {:?}", tok))
    } else {
        Ok(tokens)
    }
}

Let's test out our tokenizer by changing main to call lex instead

fn main() {
    println!("{:#?}", lex(&"(+ 1 2)".to_string()));
    println!("{:#?}", lex(&"(def add (lambda (x y) (+ x y)))".to_string()));
}
$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/stutter`
Ok(
    [
        Lparen,
        Plus,
        Int(
            BigInt {
                sign: Plus,
                data: BigUint {
                    data: [
                        1,
                    ],
                },
            },
        ),
        Int(
            BigInt {
                sign: Plus,
                data: BigUint {
                    data: [
                        2,
                    ],
                },
            },
        ),
        Rparen,
    ],
)
Ok(
    [
        Lparen,
        Def,
        Id(
            "add",
        ),
        Lparen,
        Id(
            "lambda",
        ),
        Lparen,
        Id(
            "x",
        ),
        Id(
            "y",
        ),
        Rparen,
        Lparen,
        Plus,
        Id(
            "x",
        ),
        Id(
            "y",
        ),
        Rparen,
        Rparen,
        Rparen,
    ],
)

That looks good. We took a string, and tokenized out a stream of tokens that will be easy to deal with in Rust.

Here's our code so far

extern crate num_bigint;
extern crate num_traits;

use num_bigint::BigInt;

#[derive(Clone, Debug, PartialEq)]
enum Token {
    Lparen,      // (
    Rparen,      // )
    Plus,        // +
    Minus,       // -
    Times,       // *
    Slash,       // /
    Percent,     // %
    Pow,         // pow
    Gt,          // >
    Lt,          // <
    Eq,          // =
    Gte,         // >=
    Lte,         // <=
    Let,         // let
    Def,         // def
    List,        // list
    Index,       // index
    Drop,        // drop
    Quote,       // quote
    Append,      // append
    Range,       // range
    Cat,         // cat
    Len,         // len
    Take,        // take
    If,          // if
    Int(BigInt), // Integer literal
    Dec(f64),    // Floating point literal
    Bool(bool),  // Boolean literal
    Id(String),  // identifier (variable name or function name)
}

fn to_token(s: &String) -> Token {
    let bytes = s.as_bytes();
    if let Some(t) = BigInt::parse_bytes(bytes, 10) {
        Token::Int(t)
    } else if let Ok(t) = s.parse::<f64>() {
        Token::Dec(t)
    } else if let Ok(t) = s.parse::<bool>() {
        Token::Bool(t)
    } else {
        match s.as_ref() {
            "(" => Token::Lparen,
            ")" => Token::Rparen,
            "+" => Token::Plus,
            "-" => Token::Minus,
            "*" => Token::Times,
            "/" => Token::Slash,
            "%" => Token::Percent,
            "pow" => Token::Pow,
            "<" => Token::Lt,
            ">" => Token::Gt,
            "=" => Token::Eq,
            ">=" => Token::Gte,
            "<=" => Token::Lte,
            "let" => Token::Let,
            "def" => Token::Def,
            "list" => Token::List,
            "take" => Token::Take,
            "if" => Token::If,
            "index" => Token::Index,
            "drop" => Token::Drop,
            "quote" => Token::Quote,
            "append" => Token::Append,
            "range" => Token::Range,
            "cat" => Token::Cat,
            "len" => Token::Len,
            _ => Token::Id(s.to_string()),
        }
    }
}

fn lex(cmd: &String) -> Result<Vec<Token>, String> {
    let mut tokens = Vec::new();
    let mut tok = String::new();
    let mut in_comment = false;
    for c in cmd.chars() {
        if in_comment {
            continue;
        }
        match c {
            ';' => in_comment = true,

            '\n' => in_comment = false,

            ')' => {
                if tok.len() > 0 {
                    tokens.push(to_token(&tok));
                    tok = String::new();
                }
                tokens.push(to_token(&String::from(")")));
            }

            '(' => {
                if tok.len() > 0 {
                    tokens.push(to_token(&tok));
                    tok = String::new();
                }
                tokens.push(to_token(&String::from("(")));
            }

            ' ' => {
                if tok.len() > 0 {
                    tokens.push(to_token(&tok));
                    tok = String::new();
                }
            }

            _ => {
                tok.push(c);
            }
        }
    }

    if tok.len() > 0 {
        Err(format!("Invalid syntax, token not matched: {:?}", tok))
    } else {
        Ok(tokens)
    }
}

fn main() {
    println!("{:#?}", lex(&"(+ 1 2)".to_string()));
    println!("{:#?}", lex(&"(def add (lambda (x y) (+ x y)))".to_string()));
}

Stay tuned for part 2 where I will go through parsing

About

bicycle, and the Des Moines Capitol Building

This blog is about whatever I find interesting at this very moment. I'll post how-to articles on occasion and various lessons that I've learned in my career.

Me

I am a software developer from Des Moines Iowa. I graduated from the University of Iowa with a B.S. in Computer Science.

I am interested in math, compilers, and interpreters, and I always strive to understand how various systems work from the ground up.

The languages I use most often include

  • C
  • Rust
  • Python
  • C++
  • Go

I like working in a UNIX or Linux environment.

Follow Me