Oct 14

Manipulating directories in C++ under Linux

Currently working on Nand2Tetris project 7, I ran into the need to accept a directory name as an argument to a C++ application. I looked for a standard way to deal with directories in C++11, but that is unfortunately not part of the standard. Since I am running Linux, I do not have to suffer that much. I implemented a solution based on the Linux/POSIX API calls opendir()/readdir()/closedir() to list files in the directory, and realpath() to get the absolute path (since I needed to extract the directory name regardless of how it was pointed to).
I ended up with two relatively simple static member functions. I could probably have made them even more general, but this is good enough for me:

and:

You will find information about the necessary #includes in your favorite man page database.
If you spot a bug, please leave a reply.

Oct 10

Performance of C++11 regular expressions

Since I want to talk about C++11 program performance, I guess I should start saying that I am running g++ (GCC) 4.9.1 on Linux (Manjaro).
In What’s in a regular expression?, I presented some preliminary testing of the regular expression functionality in C++ that I could use in the scope of the assembler parser for Nand2Tetris.
In Linux application profiling can be spectacular, I explained how I divided the running time of my assembler by 100, by constructing regular expressions only once instead of once per row to parse.
After the regular expression improvement, I could not help rewriting the parser (only 130 lines of code) code with some std::string functions instead of regular expressions, to see what running time I would get.
The std::string functions I used were the following:

  • std::remove_if()
  • string::erase()
  • std::isdigit()
  • std::isspace()
  • string::substr()
  • string::find()

I guess you get the picture (I won’t show the code because it is contrary to Nand2Tetris’ policy).
The numbers are as follows to assemble a 20000 line-assembly file,
with regular expressions:

and with std::string:

So it is better by 30% with std::string.
My Parser.cpp is about the same size in both cases, but I guess that is because the pattern rules are simple. If they got more complicated, the regular expression version would be cheaper to maintain and not grow so much in size.

Oct 09

Linux application profiling can be spectacular

I now have an assembler that works for Hack assembly files without symbols. This was the first step proposed in project 6 of Nand2Tetris.
However, I really got scared by the assembler’s execution time:

This is to assemble a 20 thousand-line assembly file. To perform the same task, the assembler provided by Nand2Tetris, written in Java, takes a fraction of a second.
This felt like an interesting challenge. My guess was that the bottlenecks could lie in my use of regular expressions (see previous post), or in the file I/O.
I took this as an opportunity to test some profiling tools.
I installed OProfile and used it:

This shows the binaries where my application spent more than 10% of its time. More than 80% is spent in the C and C++ libraries.
Trying to take it one level down:

In theory, this should shows the names of the functions where it spends more than 2% of the time, but for some reason it does not work for the C++ library.
Still, it feels like the regular expressions are a very good candidate.
My first idea of optimization was to construct the regular expressions in the parser’s constructor, instead of doing it in the parsing methods (once per line of assembly code). That change took me 5 minutes, and here is the result:

That’s right, the times are divided by 100! :-)
OProfile now tells us:

Still a lot related to regular expressions, so there is certainly much more to be done. But I am happy for now (I am no longer ashamed). 😉
Edit: for the record, I have also made the regular expressions static const, which in general seem better, but that did not increase performance further.
Edit 2: I have also tried std::regex::optimize, but that does not either affect performance in my case.

Oct 09

What’s in a regular expression?

While working on my Nand2Tetris assembler parser (project 6), I devised with myself about whether or not I would use C++11’s regular expressions to parse assembly lines. I answered yes to that question, because:

  • Regardless of the performance discussion, I like the idea of a language that can match any text pattern. When I am in an environment when such a tool is available, I do not want to reinvent that wheel.
  • While manipulating some text in an editor, I often feel that a certain repetitive manual task would take a few seconds instead of many minutes if I just had a good command of regular expressions. Therefore, this is an opportunity to get better at it.

Spoiler warning: the rest of this post will discuss some details of the assembler parser implementation for Nand2Tetris. While I will mostly focus on the regular expression part, and do not post a complete solution, you should not read it if you want to take a completely fresh start at the project.
The power horse of the Hack computer instruction set (see chapter 6) is the C-instruction.
To quote the book, C-instruction:

Either or. That means not both. I.e. an instruction with neither dest nor jump (that would be a NOP) is invalid.
I am thinking about having one regular expression for the A-instruction, one for the C-instruction, and one for the symbol pseudo-command (that names labels).
My ambition for the parser is just to decode a correctly written line of assembly code:

  • An empty line or a line of comments shall just be discarded, i.e. my regular expressions shall not match them.
  • An A-instruction line shall be identified as such and the symbol string shall be extracted.
  • A C-instruction line shall be identified as such and the dest, comp and jump strings shall be extracted.
  • I do not ambition for my parser to reject an invalid symbol, dest, comp, or jump string. My plan is to let the invoker take care of that. That should make for relatively simple regular expressions and for now, it seems like an acceptable limitation (to be confirmed).

C++11 use a modified ECMAScript regular expression grammar. My reference for this is cppreference.com, but I am also using other sources to help me understand that information, which is quite dry.
A first attempt at a regular expression for the C-instruction could be (I separate the string in several strings in a C/C++ manner, in order to make the structure clearer – also note the C/C++ escaping of “\”, which is not part of the actual regular expression):

Let’s try it:

results in

The first match is the whole expression that matched. The other matches are the extracted strings. This did not work: the comp match also contains the jump part, and the jump match is empty. A reasonable way to prevent the comp part from eating the jump part is probably to just exclude “;” from the class of valid characters for comp. One more try:

results in

That worked. I replaced "\\S+" (1 or more non-space characters) by "[^\\s;]+" (1 or more characters that are neither spaces nor semicolon).
But since our rule for comp is so unspecific, and we require neither dest nor jump, I worry that a line with only one comment and without any spaces would match that rule too:

results in

That’s what I thought. Not good. An obvious way to fix this is to exclude “/” in the same way as we excluded “;”. Given the low level of ambition of our pattern matching for now, it is also a reasonable solution:

gives

It worked. Better get back to my parser now.