Parsing Text in JavaScript: Part 1

Parsing Text in JavaScript: Part 1

Patrick Burris

October 16th, 2018

Overview

In this post we will be building a program that takes in a string containing some math expression and outputs the number that corresponds to the answer. The type of parser we are going to build is called a Pratt parser, the same type of parser that many JS linters (like JSLint) use. The steps are as follows:

  • Parsing text one letter at a time
  • Lexing that text into tokens
  • Parsing the tokens and creating expressions
  • Evaluating those expressions and returning the answer

The main concepts that will be covered are:

  • Recursive descent Pratt parser
  • Operator precedence parsing
  • Lexing
  • Abstract Syntax Tree
  • Tokenization
  • Expression objects

We will start with defining the tokens we will use and creating some helper functions around those tokens. Once the tokens are set we can build our lexer. After the lexer we build our Abstract Syntax Tree (AST) full of expressions. Once we have our AST we can run through the tree one node at a time and recursively evaluate the tree into a single output.

You can find the finished calculator here

Note -- I am using ES modules using the import keyword

To do this yourself your Node version must be atleast 8.5 and you must run the code with the --experimental-modules flag. You can also skip that and setup a transpiler like babel to run the code or use something like webpack to build the files.

Tokens

We start by defining out tokens module. I created a tokens.js file for my project and start exporting the different tokens. I named them with capital letters e.g. PLUS for the plus sign. Below is my final list of tokens.

The actual format for the token will be an object with the type and a string that is the literal, or the actual value from the text. A number might look like the following:

const NumberToken = {
  type: NUMBER, // from tokens.js above
  literal: '1337', // the actual number held in this token
};

A PLUS token would look like:

const PlusToken = {
  type: PLUS,
  literal: '+', // the actual plus sign
};

The next step is implementing the lexer. The lexer is what iterates through the text and generates the tokens. This process would turn something like the following string 1 + 2 and turns into:

const tokens = [
  { type: NUMBER, literal: '1' },
  { type: PLUS, literal: '+' },
  { type: NUMBER, literal: '2' },
];

The tokens do not require any tests

Lexer

Now we have our tokens made we will move on to building the Lexer. Lexing is the process of converting the sequence of characters into a sequence of tokens. The tokens give meaning to characters and will serve as a simple way to parse and build the expressions. When we encounter a problem with the incoming source code we can emit an ILLEGAL token to let the parser know that there was a lexing error.

The calculator's lexer will be simple. We will parse through one character at a time via a function called nextToken, that function will check the current character and either create and return a token or check the next characters until a token can be made. The Lexer class will take in the source code and then increment the current position and the read position as well as set our current character.

The bulk of the work comes from the nextToken function. We start the nextToken function by skipping any white space. That is to say any spaces for now (but you can include tabs, newlines and returns). Inside the nextToken function is a giant switch-case for our current character. This will check if we are looking at a plus sign or a number or anything else. After the current token is built we progress to the next character and return the token.

We will start by writing readChar method. This will check if we are at the end of the input and return the EOF character otherwise the next character is set and the read position is moved.

  readChar() {
    // End of the input
    if (this.readPos >= this.input.length) {
      this.ch = ';';
    } else {
      this.ch = this.input[this.readPos];
    }
    this.pos = this.readPos;
    this.readPos += 1;
  }

The skipWhitespace function looks like the following:

  skipWhiteSpace() {
    while (this.ch === ' ') {
      this.readChar();
    }
  }

With those two functions we are ready to build the nextToken function. This function starts by calling skipWhiteSpace so we at a non-space character. The switch-case will take the currect character. The cases will include items like the various signs, parentheses and the semicolon. Our default case will check to see if we need to read a number or return an illegal value. I also added a helper function in the tokens file for creating a new token.

Now that we have finished our Lexer we can write a test to see if we are getting the correct tokens. Throughout this guide I will be using Node's assert module. I did not feel that setting up a testing framework was worth it for this project. It is not very complicated and doesn't need anything like mocking for us to test effectively.

We are testing if the tokens that the Lexer spit out with the nextToken function are the same that we expect. This is the only test needed for the Lexer.

Part 2

Watch out for part 2 where will build the syntax tree and parser

Patrick Burris

Software Developer

I am a web developer living and working in Tulsa, Oklahoma. Software is my hobby, web development is my job and web security is my passion. Send me a message using the form below or connect with me on Linkedin, I am always looking for new consulting opportunities.

Contact Me