Iterators

I am using Rust to write an interpreter for a small subset of Lisp to restart my Rust journey. One of the pieces required for the interpreter is a tokenizer which converts the entire string into tokens. Since Lisp is mostly parens, characters, whitespaces and dots, it was an easy implementation – although made difficult because … Rust.

Anyhow, my approach was to create a tokenizer which would give back each token which I can then convert into an AST. Since this is word based tokenizer, I did this to save myself some time and not write a character based tokenizer.

Although it would not be as helpful as I thought it would be, and I will have to implement a character based tokenizer, this was a good exercise because I got to implement the iterator pattern.

So let’s try to understand how iterators work in Rust.

An iterator is a lazy collection which iterates over a bunch of values, giving one at a time and stopping when either we finish all the values or some other condition is met. Since it’s lazy, the next value is not calculated until it is asked for, in the meantime the iterator just rests and does nothing, restarting again when the next value is asked for.

Since it has to start and stop so many times, it needs a place to save its state. The implementation demands a struct which holds this state and each iteration needs to return the next value update the state, or stop in case the values are exhausted.

We start with creating the struct. The state here depends on the value you are going to use to return the value on each iteration. This is the state of the iterator and is updated in each iteration.

We also create the implementation function of this struct, which allows us to create the instance. This just seeds the state with the string that we pass.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
pub struct Tokenizer {
    state: String
}

impl Tokenizer {
    pub fn new(s: String) -> Tokenizer {
        Tokenizer {
            state: s
        }
    }
}

We now implement the actual functionality, i.e. implement the Iterator trait for Tokenizer. The heart of this implementation is a function called next which returns the next token from the string. It uses another function called parse_token which returns the next token and the remaining string. The implementation function can depend on any other function for its working.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
  impl Iterator for Tokenizer {
    type Item = String;

    fn next(&mut self) -> Option<String> {

        if self.state.is_empty() {
            None
        } else {
            let (token, remaining) = parse_token(self.state.as_str());
            self.state = remaining;
            Some(token)
        }
    }
}

And finally here’s the parse_token function. I am passing the state as a string reference &str instead of String because I had implemented the function as such in the original implementation. It hardly matters one way or the other. Although I must point out that using references with struct will force you to implement lifetime constraints so this example is a bit light on the brain.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
pub fn parse_token(a: &str) -> (String, String) {
    if a == "" {
        panic!("found empty string while parsing.")
    }

    let mut pos = 0;
    let mut a = String::from(a.trim_start());

    for i in a.chars() {
        // A whitespace cannot be found at the start because we've
        // trimmed it already. So this one's coming after an atom.
        if i.is_whitespace() {
            break;
        }

        if i == '(' || i == '.' {
            pos += 1;
            break;
        }
        if i == ')' {
            if pos > 0 { // An atom was already found, return it.
                break;
            } else {
                pos += 1;
                break;
            }
        }

        pos += 1;
    }

    // Absolutely confusing documentation of `split_off`, the string
    // is reduced from 0 to the pos, and the RETURNED string is the
    // remaining one.
    let remaining = a.split_off(pos);
    (a, remaining)
}

Here’s how it works:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#[test]
fn tokenizer_finds_the_correct_tokens() {
    // Input is not a valid form, but used here to test
    let input = String::from("(defun abc (x . y) bar baz)");
    let tokenizer = Tokenizer::new(input);
    let count: usize = tokenizer.count();
    assert_eq!(count, 11);
}

fn main() {
    let s = String::from("(defun abc (x . y) bar baz)");
    let tokenizer = Tokenizer::new(s);
    let tokens: Vec<String> = tokenizer.collect();
    println!("{:?}", tokens);
    // Output: ["(", "defun", "abc", "(", "x", ".", "y", ")", "bar", "baz", ")"]
}