Many interpreters rely on external tooling, such as parser generators. In this tutorial, we will learn things the hard way by coding everything by hand!
But before we get to that, we need to define the basics. What is an interpreter?
An interpreter is a program that reads text, then executes that text as a program.
Interpreters are closely related to compilers. A compiler is a program that goes a step further. Compilers translate the text from one programming language into another language. Commonly, compilers translate the text of a high level language, such as Rust, into the machine code that a computer can execute. Compilers can have targets other than machine code. Java, for example, translates Java code into code that a virtual machine called the JVM can execute. TypeScript utilizes a compiler that translates TypeScript code into JavaScript, which is then translated into byte code for your browser's JavaScript virtual machine (V8 for Chrome, or SpiderMonkey for Firefox).
Interpreters and compilers share several things in common. If you build an interpreter, you are about half way to having a compiler.
Compilers are implemented in two major parts, commonly referred to as the front end and back end. The front end is responsible for dealing with text, while the back end is responsible for generating the output code. Each of these parts are composed of smaller parts, which commonly looks something like this.
The lexer or tokenizer is responsible for taking a string of text, and converting it into tokens, which are the smallest elements of a program that carry some amount of meaning. Tokens in a formal language are the equivalent of words in a natural language. Take this sentence for instance.
How now brown cow?
If you ran this sentence through a lexer, you would get the list of words["How", "now", "brown", "cow", "?"]
Let's take a look at this C program below, and see what would happen if we tokenized it.
1 int gcd(int a, int b) { 2 if (b == 0) { 3 return a; 4 } else { 5 return gcd(b, a % b); 6 } 7 }
["int", "gcd", "(", "int", "a", ",", "int", "b", ")", "{",...]
Let's begin looking at the syntax that we will write a parser for. In this tutorial we will be making a Lisp like programming language. Since Lisp (List Processor) is also the name of a speech impediment... in that spirit, let's call this language Stutter!
Here we can see the GCD function from before, but now implemented in Stutter.
1 (def gcd 2 (lambda (a b) 3 (if (= 0 b) 4 a 5 (gcd b (mod a b)))))
First things first, we need to create the project using the cargo program that comes bundled with Rust. If you do not have Rust installed already, you can get it here
$ cargo new stutter --bin Created binary (application) `stutter` package
Since this program won't be very long, for now let's write all the code in main.rs
We need to have a way to represent each token within Rust. This is a perfect place to use an enum. We'll create an enum called Token that can be used to represent every possible construct within our programming language.
#[derive(Clone, Debug, PartialEq)] enum Token { Lparen, // ( Rparen, // ) Plus, // + Minus, // - Times, // * Slash, // / Percent, // % Pow, // pow Gt, // > Lt, // < Eq, // = Gte, // >= Lte, // <= Let, // let Def, // def List, // list Index, // index Drop, // drop Quote, // quote Append, // append Range, // range Cat, // cat Len, // len Take, // take If, // if Int(BigInt), // Integer literal Dec(f64), // Floating point literal Bool(bool), // Boolean literal Id(String), // identifier (variable name or function name) }
We will want our tokenizer to recognize each symbol that can stand on its own. Let's write code that can deduce each symbol.
As a bonus, because Rust allows one to easily plug in additional libraries, for our integer type, we can use the num-bigint to represent our integers, so that way they won't be limited to just 64 bits. This crate allows the integers to be arbitrarily large... just as long as they still fit into our computer's memory. Our laptop Turing Machines don't yet have infinite tape.
Add the following lines to your Cargo.toml file to download the num-bigint dependencies.
[dependencies] num-bigint = "0.2" num-traits = "0.2"
And now include them in main.rs
extern crate num_bigint; extern crate num_traits; use num_bigint::BigInt;
Where were we? Oh yes! Back to the tokenizer! We'll make a function that reads in a string, and returns the appropriate kind of token for that string.
This code first tries to parse the string as a BigInt. If that fails, it then tries to parse it as a 64 bit float. If that fails, it moves on to bool, then it will try a list of various reserved words, finally culminating in an identifier, after all else fails.
An identifier is something like a function name or a variable name in a programming language.
fn to_token(s: &String) -> Token { let bytes = s.as_bytes(); if let Some(t) = BigInt::parse_bytes(bytes, 10) { Token::Int(t) } else if let Ok(t) = s.parse::<f64>() { Token::Dec(t) } else if let Ok(t) = s.parse::<bool>() { Token::Bool(t) } else { match s.as_ref() { "(" => Token::Lparen, ")" => Token::Rparen, "+" => Token::Plus, "-" => Token::Minus, "*" => Token::Times, "/" => Token::Slash, "%" => Token::Percent, "pow" => Token::Pow, "<" => Token::Lt, ">" => Token::Gt, "=" => Token::Eq, ">=" => Token::Gte, "<=" => Token::Lte, "let" => Token::Let, "def" => Token::Def, "list" => Token::List, "take" => Token::Take, "if" => Token::If, "index" => Token::Index, "drop" => Token::Drop, "quote" => Token::Quote, "append" => Token::Append, "range" => Token::Range, "cat" => Token::Cat, "len" => Token::Len, _ => Token::Id(s.to_string()), } } }
Let's test out this code in main, and give it some example tokens.
fn main() { println!("{:?}", to_token(&"(".to_string())); println!("{:?}", to_token(&")".to_string())); println!("{:?}", to_token(&"def".to_string())); println!("{:?}", to_token(&"+".to_string())); println!("{:?}", to_token(&"myfunc".to_string())); println!("{:?}", to_token(&"myvar".to_string())); println!("{:?}", to_token(&"3.14".to_string())); println!("{:?}", to_token(&"42".to_string())); }
When we run it
$ cargo run Finished dev [unoptimized + debuginfo] target(s) in 0.55s Running `target/debug/stutter` Lparen Rparen Def Id("myfunc") Id("myvar") Dec(3.14) Int(BigInt { sign: Plus, data: BigUint { data: [42] } })
Looks good! Now we need a function that can take a string and split it into individuals tokens.
fn lex(cmd: &String) -> Result<Vec<Token>, String> { let mut tokens = Vec::new(); let mut tok = String::new(); let mut in_comment = false; for c in cmd.chars() { if in_comment { continue; } match c { ';' => in_comment = true, '\n' => in_comment = false, ')' => { if tok.len() > 0 { tokens.push(to_token(&tok)); tok = String::new(); } tokens.push(to_token(&String::from(")"))); } '(' => { if tok.len() > 0 { tokens.push(to_token(&tok)); tok = String::new(); } tokens.push(to_token(&String::from("("))); } ' ' => { if tok.len() > 0 { tokens.push(to_token(&tok)); tok = String::new(); } } _ => { tok.push(c); } } } if tok.len() > 0 { Err(format!("Invalid syntax, token not matched: {:?}", tok)) } else { Ok(tokens) } }
Let's test out our tokenizer by changing main to call lex instead
fn main() { println!("{:#?}", lex(&"(+ 1 2)".to_string())); println!("{:#?}", lex(&"(def add (lambda (x y) (+ x y)))".to_string())); }
$ cargo run Finished dev [unoptimized + debuginfo] target(s) in 0.04s Running `target/debug/stutter` Ok( [ Lparen, Plus, Int( BigInt { sign: Plus, data: BigUint { data: [ 1, ], }, }, ), Int( BigInt { sign: Plus, data: BigUint { data: [ 2, ], }, }, ), Rparen, ], ) Ok( [ Lparen, Def, Id( "add", ), Lparen, Id( "lambda", ), Lparen, Id( "x", ), Id( "y", ), Rparen, Lparen, Plus, Id( "x", ), Id( "y", ), Rparen, Rparen, Rparen, ], )
That looks good. We took a string, and tokenized out a stream of tokens that will be easy to deal with in Rust.
Here's our code so far
extern crate num_bigint; extern crate num_traits; use num_bigint::BigInt; #[derive(Clone, Debug, PartialEq)] enum Token { Lparen, // ( Rparen, // ) Plus, // + Minus, // - Times, // * Slash, // / Percent, // % Pow, // pow Gt, // > Lt, // < Eq, // = Gte, // >= Lte, // <= Let, // let Def, // def List, // list Index, // index Drop, // drop Quote, // quote Append, // append Range, // range Cat, // cat Len, // len Take, // take If, // if Int(BigInt), // Integer literal Dec(f64), // Floating point literal Bool(bool), // Boolean literal Id(String), // identifier (variable name or function name) } fn to_token(s: &String) -> Token { let bytes = s.as_bytes(); if let Some(t) = BigInt::parse_bytes(bytes, 10) { Token::Int(t) } else if let Ok(t) = s.parse::<f64>() { Token::Dec(t) } else if let Ok(t) = s.parse::<bool>() { Token::Bool(t) } else { match s.as_ref() { "(" => Token::Lparen, ")" => Token::Rparen, "+" => Token::Plus, "-" => Token::Minus, "*" => Token::Times, "/" => Token::Slash, "%" => Token::Percent, "pow" => Token::Pow, "<" => Token::Lt, ">" => Token::Gt, "=" => Token::Eq, ">=" => Token::Gte, "<=" => Token::Lte, "let" => Token::Let, "def" => Token::Def, "list" => Token::List, "take" => Token::Take, "if" => Token::If, "index" => Token::Index, "drop" => Token::Drop, "quote" => Token::Quote, "append" => Token::Append, "range" => Token::Range, "cat" => Token::Cat, "len" => Token::Len, _ => Token::Id(s.to_string()), } } } fn lex(cmd: &String) -> Result<Vec<Token>, String> { let mut tokens = Vec::new(); let mut tok = String::new(); let mut in_comment = false; for c in cmd.chars() { if in_comment { continue; } match c { ';' => in_comment = true, '\n' => in_comment = false, ')' => { if tok.len() > 0 { tokens.push(to_token(&tok)); tok = String::new(); } tokens.push(to_token(&String::from(")"))); } '(' => { if tok.len() > 0 { tokens.push(to_token(&tok)); tok = String::new(); } tokens.push(to_token(&String::from("("))); } ' ' => { if tok.len() > 0 { tokens.push(to_token(&tok)); tok = String::new(); } } _ => { tok.push(c); } } } if tok.len() > 0 { Err(format!("Invalid syntax, token not matched: {:?}", tok)) } else { Ok(tokens) } } fn main() { println!("{:#?}", lex(&"(+ 1 2)".to_string())); println!("{:#?}", lex(&"(def add (lambda (x y) (+ x y)))".to_string())); }
This blog is about whatever I find interesting at this very moment. I'll post how-to articles on occasion and various lessons that I've learned in my career.
I am a software developer from Des Moines Iowa. I graduated from the University of Iowa with a B.S. in Computer Science.
I am interested in math, compilers, and interpreters, and I always strive to understand how various systems work from the ground up.
The languages I use most often include
I like working in a UNIX or Linux environment.