Jan 30

Less talk, more code: minimalist bare metal programming from scratch episode 0

I have rebooted my software development activities with the STM32F4-Discovery around a (for me) new concept: minimalist bare metal programming from scratch. The idea is to go through the development of an unlimited number of self-contained applications of increasing complexity, starting from scratch.
More on the concept can be found at bare.
See also my Git repo’s feed on this page.
I could post walk-throughs if there is some interest. Drop me a line in that case.

Nov 21

Studying disassembled while_one on the STM32F4-Discovery

In my previous article, I described how I compiled and run/debugged a C forever empty loop on the STM32F4-Discovery with the bare necessities (GNU Emacs, GNU make, OpenOCD, GDB). Since the job is quite simply already done in the GNU Tools for ARM Embedded Processors samples, I just use it. It is quite simply done, but it actually contains quite a lot to learn from.
The purpose of this activity is to study in detail the disassembled code for that program, and the corresponding source code. Obviously, the interesting part of the source code is not the C-code, limited to a main function that contains “for (;;);“, but the start-up code (in assembly in the sample) and the linker script.
The reference document used to interpret the disassembled code is [1] Cortex-M4 Devices Generic User Guide, that I discovered recently and that looks like the perfect reference for software developers, at least as long as one limits oneself to generic Cortex-M4 code. Also, [2] STM32F407VG data-sheet is used for the specific memory map.
As mentioned in a previous article, the disassembled code is the following:

[1] specifies that the vector table is located at address 0x0000 0000. However, [2] specifies that addresses 0x0000 0000-0x000F FFFF are aliased to flash (in our boot pin case) and that flash addresses are 0x0800 0000-0x080F FFFF.
Therefore, the beginning of the assembly code above makes sense.

According to [1], the first value in the vector table is the initial stack pointer (SP) value. In our case, this is 0x2002 0000, which according to the memory map in [2] is the address just above the highest position is regular SRAM. This is consistent with [1], that specifies: “The processor uses a full descending stack. This means the stack pointer holds the address of the last stacked item in memory. When the processor pushes a new item onto the stack, it decrements the stack pointer and then writes the item to the new memory location”. At reset, the stack is empty.

According to [1], this is the reset entry. Also, “reset is invoked on power up or a warm reset. The exception model treats reset as a special form of exception. When reset is asserted, the operation of the processor stops, potentially at any point in an instruction. When reset is deasserted, execution restarts from the address provided by the reset entry in the vector table”. Also according to [1], “The least-significant bit of each vector must be 1, indicating that the exception handler is Thumb code”. In our case, the processor will jump to 0x0800 0048, which is:

We will walk through that code later on. Let’s carry on with the vector table.

According to [1], the entries from 0x0008 to 0x0018 correspond to NMI, hard fault, memory management fault, bus fault, and usage fault, respectively. They all point to 0x0800 0088, which is:

This is a forever empty loop. The form b.n, according to [1], forces a 16-bit instruction (e7fe as we see). I haven’t dived into the binary ISA, but since the same instruction is used for main(), it is obviously a branch to an address relative to the program counter.
Continuing in the vector table, the three dots symbolized an area that according to [1] is “Reserved”.
0x002c is the SVCall entry, also pointing to 0x0800 0088. 0x0030 is according to [1] “reserved for debug”, 0x0034 is just “reserved”. 0x038 to 0x0040 are the PendSV, Systick and IRQ0 entries, respectively, also pointing to 0x0800 0088.

Let’s now have a look at the reset handler:

It is in fact difficult to interpret without studying the source code, more precisely the assembly file startup_ARMCM4.S, provided in GNU Tools for ARM Embedded Processors samples for Cortex-M4. I will do that in my next article.
For now, I will conclude this article saying that the reset handler copies some data from flash to RAM and clears one BSS section (a BSS section is a section of data that is initialized to zero when the program starts). However, the constants located at 0x0800 0078-0x0800 0084, which are the start and end addresses for these sections, are all the same. This implies that the sections have a size of zero words. That is not surprising, since the program does not have any static data.
Lastly, the reset handler executes SystemInit, which returns without doing anything, and branches to main, which is our main empty forever loop.

Oct 10

Nand2Tetris: assembler implemented and verified (project 6)

Nand2Tetris‘ assembler/comparator thinks that the 20000 line-binary file produced by my assembler for the pong game is correct to the bit, which means that my assembler, although I know it is not even close to being robust, is now good enough for my purpose.
As usual, the book contains a very detailed analysis of the problem to solve, and a clean design proposal. What is left is quite a straightforward implementation. Still, it is not entirely trivial, and one gets the satisfaction to have gone one step further towards the goal of a computer built from Nand gates that will be able to run graphics programs written in a high level language.
From a software and hardware development process perspective, the course is also very pedagogic, providing the means to test the results of every project. Encouraged by that mindset, I implemented a test class for the assembler parser, that helped me to verify that I had not broken anything when I added more functionality. In fact, I did write the test cases and run them before even starting to write the corresponding parser code, so one could say that I applied the principles of test driven development.
Given the little scope of the project, I implemented support for this little unit testing in my main() function:

In order for PARSERTESTER_HPP to be defined, I only have to add:

This way, I can keep the rest of my file and Makefile structure untouched. When the #include is there, my application will be a unit test application instead of being the full assembler. My the test code is written to throw an exception any time a test does not pass. The exception won’t be caught and will lead to a crash of the application. If the test application writes “Test successful”, it means that it run to completion without hitting a throw. Primitive, but simple.
Most of the time I spent in this project was researching a good solution for the parser in C++ (see my 3 previous articles).
The times I showed in Performance of C++11 regular expressions were for a one pass-implementation of the assembler that had not support for labels.
Interestingly, the times for the complete version, which has two passes, i.e. parses the whole source file twice, are not much longer.
One pass:

Two passes:

It would therefore seem that most of the time is spent in input/output from and to hard disk. A bugged version of the assembler that did not write the output file and that I happened to time seemed to show that most of the “sys” time in a working version is spent writing the file to disk. Maybe that could be optimized in some way (I haven’t done the math).
I will now move on to chapter 7, entitled VM I: Stack Arithmetic. :-)