Jan 30

Less talk, more code: minimalist bare metal programming from scratch episode 0

I have rebooted my software development activities with the STM32F4-Discovery around a (for me) new concept: minimalist bare metal programming from scratch. The idea is to go through the development of an unlimited number of self-contained applications of increasing complexity, starting from scratch.
More on the concept can be found at bare.
See also my Git repo’s feed on this page.
I could post walk-throughs if there is some interest. Drop me a line in that case.

Jan 17

Studying while_one linker script on the STM32F4-Discovery

In my previous article, I presented a quite detailed analysis of the binary produced by compiling the “minimal” code sample from GNU Tools for ARM Embedded Processors.
I concluded that in order to have a complete interpretation, one needed to analyze the source code, more precisely the linker script and the start-up code (in assembly). Note that the “payload”-code for the while_one program is, as its name implies, trivial. The linker script and the start-up code are, on the other hand, not trivial, and I will be analyzing the linker script in the rest of this article.
In order to do that, we need to consult [3] ld documentation (part of GNU Binutils documentation).
I might as well start by commenting on the linker script’s name: nokeep.ld. I could not find a clear comment about that in the code, but comparing the file with gcc.ld, which is used for most samples, shows that the LD command KEEP is used far many more times in gcc.ld that in nokeep.ld. We will come back to that command later on.

The linker script starts by including another linker script. This is actually a change that I made, since nokeep.ld and gcc.ld define the same memory regions, which I needed to adapt the the STM32F4-Discovery board. The contents of mem.ld are:

This corresponds to the board’s physical memory map, as specified in [2] STM32F407VG data-sheet.
We will see later on how these memory areas are referred to in the rest of the script.
Next, the linker script includes some introductory comments worth reading:

The “other linker script that defines memory regions FLASH and RAM” is the one included above.
We can see that the rest of the code is supposed to define the symbol Reset_Handler. We will see in the next article that the start-up code (in assembly) does that.
The next row in the linker script is:

According to [3], “The first instruction to execute in a program is called the entry point. You can use the ENTRY linker script command to set the entry point. The argument is a symbol name”. As described in my previous article, Reset_Handler is effectively the start of the first instructions that get executed by the processor.
The rest of the linker script is a single high level block:

According to [3], “The SECTIONS command tells the linker how to map input sections into output sections, and how to place the output sections in memory”. The first output section is:

.text is the name of the output section. As a software engineer, I expect the text section to hold some executable code. We see that it is placed in the FLASH memory region, which seems logical.
Within the curly brackets come some output section commands according to [3].
The first of these is:

If we start by ignoring KEEP we see, according to [3], a fairly typical input section specification that tells the linker to output the .isr_vector sections from all object files (to the .text output section in flash). As we will see in my next article, there is only one such section, defined in the start-up code (in assembly). It is as discussed in my previous article, the vector table.
As for KEEP, according to [3]: “When link-time garbage collection is in use (`–gc-sections’), it is often useful to mark sections that should not be eliminated. This is accomplished by surrounding an input section’s wildcard entry with KEEP(), as in KEEP(*(.init)) or KEEP(SORT_BY_NAME(*)(.ctors))”.
A quick look at makefile.conf and our Makefile will confirm that we do indeed make use of --gc-sections (to reduce code size). The presence of the vector table is required for Cortex-M4 (see [1] Cortex-M4 Devices Generic User Guide), but the linker does not know that. KEEP is how we force the linker to output that input section anyway.
Too be continued (maybe)…

Nov 21

Studying disassembled while_one on the STM32F4-Discovery

In my previous article, I described how I compiled and run/debugged a C forever empty loop on the STM32F4-Discovery with the bare necessities (GNU Emacs, GNU make, OpenOCD, GDB). Since the job is quite simply already done in the GNU Tools for ARM Embedded Processors samples, I just use it. It is quite simply done, but it actually contains quite a lot to learn from.
The purpose of this activity is to study in detail the disassembled code for that program, and the corresponding source code. Obviously, the interesting part of the source code is not the C-code, limited to a main function that contains “for (;;);“, but the start-up code (in assembly in the sample) and the linker script.
The reference document used to interpret the disassembled code is [1] Cortex-M4 Devices Generic User Guide, that I discovered recently and that looks like the perfect reference for software developers, at least as long as one limits oneself to generic Cortex-M4 code. Also, [2] STM32F407VG data-sheet is used for the specific memory map.
As mentioned in a previous article, the disassembled code is the following:

[1] specifies that the vector table is located at address 0x0000 0000. However, [2] specifies that addresses 0x0000 0000-0x000F FFFF are aliased to flash (in our boot pin case) and that flash addresses are 0x0800 0000-0x080F FFFF.
Therefore, the beginning of the assembly code above makes sense.

According to [1], the first value in the vector table is the initial stack pointer (SP) value. In our case, this is 0x2002 0000, which according to the memory map in [2] is the address just above the highest position is regular SRAM. This is consistent with [1], that specifies: “The processor uses a full descending stack. This means the stack pointer holds the address of the last stacked item in memory. When the processor pushes a new item onto the stack, it decrements the stack pointer and then writes the item to the new memory location”. At reset, the stack is empty.

According to [1], this is the reset entry. Also, “reset is invoked on power up or a warm reset. The exception model treats reset as a special form of exception. When reset is asserted, the operation of the processor stops, potentially at any point in an instruction. When reset is deasserted, execution restarts from the address provided by the reset entry in the vector table”. Also according to [1], “The least-significant bit of each vector must be 1, indicating that the exception handler is Thumb code”. In our case, the processor will jump to 0x0800 0048, which is:

We will walk through that code later on. Let’s carry on with the vector table.

According to [1], the entries from 0x0008 to 0x0018 correspond to NMI, hard fault, memory management fault, bus fault, and usage fault, respectively. They all point to 0x0800 0088, which is:

This is a forever empty loop. The form b.n, according to [1], forces a 16-bit instruction (e7fe as we see). I haven’t dived into the binary ISA, but since the same instruction is used for main(), it is obviously a branch to an address relative to the program counter.
Continuing in the vector table, the three dots symbolized an area that according to [1] is “Reserved”.
0x002c is the SVCall entry, also pointing to 0x0800 0088. 0x0030 is according to [1] “reserved for debug”, 0x0034 is just “reserved”. 0x038 to 0x0040 are the PendSV, Systick and IRQ0 entries, respectively, also pointing to 0x0800 0088.

Let’s now have a look at the reset handler:

It is in fact difficult to interpret without studying the source code, more precisely the assembly file startup_ARMCM4.S, provided in GNU Tools for ARM Embedded Processors samples for Cortex-M4. I will do that in my next article.
For now, I will conclude this article saying that the reset handler copies some data from flash to RAM and clears one BSS section (a BSS section is a section of data that is initialized to zero when the program starts). However, the constants located at 0x0800 0078-0x0800 0084, which are the start and end addresses for these sections, are all the same. This implies that the sections have a size of zero words. That is not surprising, since the program does not have any static data.
Lastly, the reset handler executes SystemInit, which returns without doing anything, and branches to main, which is our main empty forever loop.

Nov 14

Back to while_one project on the STM32F4-Discovery

I have left my STM32F4-Discovery in its box for a long time while, among others, working on Nand2Tetris, but I have been missing it. I would now like to rebuild the while_one project from scratch and continue from there, with only the bare necessities:

the two latter simply being unpacked in my home directory, with the purpose of serving as code copy/paste sources, my idea being to include as little generic code as possible in my projects, in order to keep control over it. The tool chain from “GNU Tools” is of course also my tool chain.
I basically run the same procedure as described in Running ARM samples on the STM32F4-Discovery, except that I run GDB in Emacs (M-gdb, command edited to arm-none-eabi-gdb -i=mi). I also change the original ARM Makefile to compile with debugging symbols (see Stm32F4DiscoveryTest).
I can then step through the source code, both the startup assembly code and the C-code in Emacs by using stepi in GDB.
Note: I finally keep the structure provided by the samples in GNU Tools for ARM Embedded Processors because it has a simple Makefile hierarchy, and seems to limit boilerplate code to a minimum. My intention is to build further from minimum.c, which basically is a “while one” program (it is actually a “for (;;);” program).

Oct 24

OpenGL SuperBible: first example under Xlib

I have started to read the OpenGL SuperBible 6th edition, which is apparently a good reference on the topic (I have never used OpenGL in my life, although I did some C++ programming on a Silicon Graphics Indigo in 1993…).
I tried to compile the example code, which failed, apparently because my version of GLFW is too new.
That did not discourage me for long, since I would rather avoid non-strictly necessary libraries like GLFW anyway.
Starting with the code from OpenGL’s own “Tutorial: OpenGL 3.0 Context Creation (GLX)”, I read the beginning of the book and tried to compile the first example, that is supposed to display a window full of red color.
The code from the book is:

That’s the kind of “hello word” I don’t really like, because it really is very far from a “hello world”. It assumes that one pulls in a whole header file supplied with the book, and a lot is going on that one does not control. It looks like this render() function is some kind of callback that gets called once in a while.
Instead, I tried to just copy/paste the contents of render() in my own startup code, mentioned above. That did not work at once, but I managed to figure it out.
To even start with OpenGL, one needs to understand the GLEW library concept (or some equivalent, but GLEW really seems to be the most common). The issue solved by GLEW is that quite many of today’s common OpenGL functions like glClearBufferfv(), are considered as extensions that may or may not be implemented by the GPU drivers, and that need to be resolved at runtime.
This is what GLEW does. It exposes the whole OpenGL API through a single #include <GL/glew.h>, and takes care of the rest via the GLEW library (which one needs to link – gcc‘s -lGLEW option will do that).
But it will not do its job if it is not first initialized (glewInit()) after an OpenGL context has been made current (glXMakeCurrent(display, win, ctx) in my case).
Additionally, in the setup mentioned above from “Tutorial: OpenGL 3.0 Context Creation (GLX)”, one has two buffers: a front one and a back one. It looks like the parameter 0 in glClearBufferfv(GL_COLOR, 0, red) refers to the back buffer, which is really what one wants. After it has been updated, one needs to swap the buffers with glXSwapBuffers(display, win).
When all that is in place, the code works, and one gets to see this:
Screenshot - 2014-10-24 - 21:47:28
You will find the code there. git clone it, run make and ./OpenGlTest under OpenGlTest.

Oct 22

Nand2Tetris: project 10 completed

I have now implemented and tested the compiler front end for the Jack compiler. The Jack language is object-based, without support for inheritance. Its grammar is mostly LL(0), which means that in most cases, looking at the next token is enough to know which alternative to choose for so-called “non-terminals”.
My final implementation is a classical top-down recursive implementation, as proposed in chapter 10.
That is however after a re-factoring of a previous version, where I tried to apply the principle that an element should itself assess whether it is of a given type. All my compileXxx() would return a boolean that indicates whether or not the current element is or not of type Xxx, with the side effect of generating the compiled code (XML for this chapter – real VM code in the next). The compileXxx() functions are then predicates, which I found kind of neat. It felt like programming Prolog in C++. I had a version of the compilation engine built on that principle that passed all the tests (which are, for the purpose of this chapter, comparisons of XML-output).
However, I later on realized that the underlying principle is just wrong from an LL(0)-perspective. LL(0) says that when there is an alternative in a grammar rule, e.g.:

which means that there may or may not be an expression in a return statement, the return statement level knows by a lookup of the next token whether or not there is an expression. This is the case here: there will be an expression if and only if the next token is not ‘;’.
With my predicate principle, the compileExpression() would in itself have to decide whether the current element is an expression or not. This in fact happens to be much harder than checking whether or not the next token is a semicolon (an expression may occur in other contexts than “return”, so it cannot check on semicolon).
In other words, even if my code worked, I would not have been able to sleep at night if I had not done a re-factoring. It was actually quite easy, albeit time-consuming and boring.

Oct 17

Nand2Tetris: project 9 completed (I guess)

The purpose of Nand2Tetris’ project 9 was to get to know the Jack language, a simple object-based programming language that we will write a compiler for in project 10 and project 11. Writing and testing a program in Jack was the way to get acquainted with the language.
I did write and test a short and silly Jack program, for the sake of it, but I am more interested in the compiler part, that I will now move on to.

Oct 17

Nand2Tetris: project 8 completed

With project 8 completed, I now have a Virtual Machine translator that takes any VM program as an input and outputs a corresponding Hack assembly file (see project 4) that can be run on the Hack CPU simulator. Since I had a few bugs, I ended up step through some code at the assembly level for the recursive Fibonacci example, which was an interesting exercise of concentration and patience.
The virtual machine in question is a single stack-stack machine, that provides support for function calling, including recursion.
After having implemented it, one feels quite at home reading section 2.1.1 Single vs. multiple stacks from the book Stack computers: the new wave by Philip Koopman (it is from 1989, so new is relative, but it is available online and it is one of the very few publications available about stack machine hardware).
Quoting the section:

An advantage of having a single stack is that it is easier for an operating system to manage only one block of variable sized memory per process. Machines built for structured programming languages often employ a single stack that combines subroutine parameters and the subroutine return address, often using some sort of frame pointer mechanism.

This “sort of frame pointer mechanism” is precisely what I have implemented in project 8. In our case, the stack machine is not built in hardware, it is implemented in the form of a translator to the machine language of a simple 16-bit register based CPU. It could however be directly built in hardware, as the many examples given in Stack computers: the new wave show. I suppose a very interesting project following this course would be to implement the VM specification of chapter 7 and chapter 8 in the HDL language in the same way as the Hack CPU was built in project 5. I am not sure how much the ALU would have to be modified to do that.
I will keep this project idea in the back of my mind for now and move on to chapter 9, where we study “Jack”, a simple object oriented high level language, that we will in later chapters write a compiler for. The compiler will use the VM translator implemented in chapter 7 and chapter 8 as a back end.

Oct 15

Nand2Tetris: project 7 completed

I have now implemented a translator for a part of the virtual machine that is used in Nand2Tetris.
A point of the virtual machine language in the course is to be used as an intermediate between high level language and assembly, in the compiler to be designed in later chapters. The virtual machine translator translates VM instructions to assembler. Its implementation is split between project 7, which I have now completed, and project 8.
The virtual machine is stack based which I enjoy by personal taste (as mentioned in a previous post I have inherited that taste from my use of RPN on HP calculators since the 80s).
The design of the virtual machine specification feels, as all concepts I have so far gone through in this course, elegant and as simple as it can be.
It features:
1. the basic arithmetic and logic operations (the same as the the ALU and CPU previously designed),
2. push and pop to transfer data between RAM and the stack,
3. program flow commands,
4. function call commands
Project 7 implements 1 and 2.
Since the VM language basic syntax is always the same, the parsing is in fact simpler than the assembly parsing of project 6. The interesting part is assembly code output, where there is potential for optimization of the number of assembler commands generated for a given VM command. I have myself worked very little on optimization because I rather want to carry on, but I might come back to it later on.

Oct 14

Manipulating directories in C++ under Linux

Currently working on Nand2Tetris project 7, I ran into the need to accept a directory name as an argument to a C++ application. I looked for a standard way to deal with directories in C++11, but that is unfortunately not part of the standard. Since I am running Linux, I do not have to suffer that much. I implemented a solution based on the Linux/POSIX API calls opendir()/readdir()/closedir() to list files in the directory, and realpath() to get the absolute path (since I needed to extract the directory name regardless of how it was pointed to).
I ended up with two relatively simple static member functions. I could probably have made them even more general, but this is good enough for me:


You will find information about the necessary #includes in your favorite man page database.
If you spot a bug, please leave a reply.