Oct 09

What’s in a regular expression?

While working on my Nand2Tetris assembler parser (project 6), I devised with myself about whether or not I would use C++11’s regular expressions to parse assembly lines. I answered yes to that question, because:

  • Regardless of the performance discussion, I like the idea of a language that can match any text pattern. When I am in an environment when such a tool is available, I do not want to reinvent that wheel.
  • While manipulating some text in an editor, I often feel that a certain repetitive manual task would take a few seconds instead of many minutes if I just had a good command of regular expressions. Therefore, this is an opportunity to get better at it.

Spoiler warning: the rest of this post will discuss some details of the assembler parser implementation for Nand2Tetris. While I will mostly focus on the regular expression part, and do not post a complete solution, you should not read it if you want to take a completely fresh start at the project.
The power horse of the Hack computer instruction set (see chapter 6) is the C-instruction.
To quote the book, C-instruction:

Either or. That means not both. I.e. an instruction with neither dest nor jump (that would be a NOP) is invalid.
I am thinking about having one regular expression for the A-instruction, one for the C-instruction, and one for the symbol pseudo-command (that names labels).
My ambition for the parser is just to decode a correctly written line of assembly code:

  • An empty line or a line of comments shall just be discarded, i.e. my regular expressions shall not match them.
  • An A-instruction line shall be identified as such and the symbol string shall be extracted.
  • A C-instruction line shall be identified as such and the dest, comp and jump strings shall be extracted.
  • I do not ambition for my parser to reject an invalid symbol, dest, comp, or jump string. My plan is to let the invoker take care of that. That should make for relatively simple regular expressions and for now, it seems like an acceptable limitation (to be confirmed).

C++11 use a modified ECMAScript regular expression grammar. My reference for this is cppreference.com, but I am also using other sources to help me understand that information, which is quite dry.
A first attempt at a regular expression for the C-instruction could be (I separate the string in several strings in a C/C++ manner, in order to make the structure clearer – also note the C/C++ escaping of “\”, which is not part of the actual regular expression):

Let’s try it:

results in

The first match is the whole expression that matched. The other matches are the extracted strings. This did not work: the comp match also contains the jump part, and the jump match is empty. A reasonable way to prevent the comp part from eating the jump part is probably to just exclude “;” from the class of valid characters for comp. One more try:

results in

That worked. I replaced "\\S+" (1 or more non-space characters) by "[^\\s;]+" (1 or more characters that are neither spaces nor semicolon).
But since our rule for comp is so unspecific, and we require neither dest nor jump, I worry that a line with only one comment and without any spaces would match that rule too:

results in

That’s what I thought. Not good. An obvious way to fix this is to exclude “/” in the same way as we excluded “;”. Given the low level of ambition of our pattern matching for now, it is also a reasonable solution:

gives

It worked. Better get back to my parser now.