Microcontroller Debugging Stories – Software Vendor Perspective

01 February 2019

Most of us here at AVSystem write code that is executed either on powerful servers, or inside a web browser. However, we have a small but dedicated team of developers who use a very different set of programming languages and target devices with many orders of magnitude less resources – the embedded development team. As a part of it, I am going to write about some of the unique challenges that we solve on an almost daily basis.

The Alignment Misalignment

Most of our team is composed of people who grew up as software developers mostly writing command-line applications targeting Unix-like operating systems. OSes such as Linux, macOS or even Windows make you used to a set of guarantees about the environment in which your code will run. And you rarely need to step outside of the comfort zone of Intel or AMD x86 processor lines.

Things are vastly different in the embedded world, though. There is a multitude of processor architectures in use (we’ve worked with different variants of ARM, MIPS and PowerPC), and the code you write runs directly on the bare metal. The most you can get is a so-called “Real-Time Operating System”, the most popular being open source FreeRTOS and commercial ThreadX, but calling them “operating systems” is a bit of a stretch. They essentially contain just a thread scheduler, and an API that abstracts away the specifics of threading primitives for a given processor – in usage, they offer little more for the programmer than e.g. the POSIX Threads API.

And sometimes they use some very aggressive optimizations to conserve the very little processing power that embedded platforms have.

Consider the following, really simple function:

void test(void) {

double value = 1.234567;

printf("%g\n", value);


You probably can’t imagine the above code printing anything else than 1.234567, can you? Well, if you worked a little bit with floating-point numbers, you might suspect there is some inaccuracy in FP representation here, but no, this is not the case. The code is indeed supposed to print out 1.234567. Yet, when debugging a problem with one of our projects, we found this function to print really bizarre numbers such as 5.74324e+94.

Then we discovered it to work properly, but only if called from the main() function. If called from any of the threads created using the RTOS we were using for that project, the problem resurfaced. Oddly enough, we had trouble reproducing this bug using anything other than printf()-style functions.

In the end, we were able to define the problem very strictly: when passing double-precision FP numbers or 64-bit integers to a variable-argument functions, their values were getting corrupted.

But why? Why only 64-bit values, and why only as varargs?

Well, after several long hours of diving through documentation, we discovered a little bit of information in the Procedure Call Standard for the ARM Architecture, on which we were running our project:

The stack must also conform to the following constraint at a public interface:

  • SP mod 8 = 0. The stack must be double-word aligned.

It turned out that the RTOS was calling the thread entry function with the stack aligned only to 4 bytes instead of 8. This is considered a “public interface” call, because the caller and callee are completely unrelated pieces of code. The compilers, on the other hand, make use of the calling convention invariants and generate vararg handling code that assumes 8-byte alignment. It’s hard to say whether it’s an outright bug in that RTOS, or if that’s an overlooked optimization. We were limited to a precompiled, binary build of the RTOS at the time, so in the end we added few lines of assembly to manually move the stack pointer and called it a day.

Can we trust the compiler? – or a story of corrupted stack

Unlike most programming languages out there, there’s no single “canonical” compiler that almost everyone would use for C and C++. Sure, when writing for full-blown OSes, you usually stick to the system default, which is GCC on Linux, Clang on macOS and Visual Studio on Windows. Or pick GCC or Clang and use it everywhere. And while those can be used for embedded development as well, much of the industry prefers using commercial tools such as the ARM Compiler or IAR Embedded Workbench. We won’t dive into comparing them to free tools or to each other; moreover, their license agreements tend to explicitly prohibit doing such comparisons. The reality, though, is that we – authors of library and middleware code – don’t have the luxury to choose our compilers anyway. We need to support whatever our customers use. With every unique set of quirks that each compiler exhibits.

One of our customers’ requirements was to use one of such commpercial compilers. Needless to say, it was not even based on any open source solution – everything other than the base ISO-standardized language, from assembler syntax to command line options, was completely custom and incompatible with any other toolchain.

Fortunately, the custom features were not too insane, and we were able to compile our code (which originally targeted GCC) with little effort. However, after some initial success we discovered that our test application crashed at seemingly random, unpredictable times.

The good news is that using a debugger with embedded systems is usually not especially difficult – most chips support JTAG, and development boards often expose debugging interfaces even over USB. So you can use typical tools such as breakpoints and single-stepping. Sometimes you run into limits (e.g. maximum number of breakpoints or watchpoints) much sooner than you would on a PC, sometimes the debugger becomes unstable for no apparent reason, and sometimes behaves in a way that is somewhat logical, but unexpected (e.g. attempting to single-step through a function may put you in the middle of RTOS’ context-switching code) – all that makes the experience a bit more irritating than on a normal computer, but essentially normal debugging is possible.

The microcontroller was entering a state known as a “hard fault”. This usually happens after attempting to access an invalid memory address or executing an invalid instruction. ARM Cortex-M microcontroller cores allow for executing some simple “hard fault handler” code, which allows examination of some fault conditions, such as the last known value of the instruction pointer. In many cases it allows you to even reconstruct the full stack trace inside the debugger.

And here comes the bad news – the crashes seemed to come literally out of nowhere. The register values stored by the hard fault handler didn’t make any sense, and what was supposed to be the last instruction pointer was an address completely outside of the known code space. The last log message, printed out to the serial port, was different every time, and there didn’t seem to be any pattern to it. Single-stepping through the code either didn’t reproduce the problem at all, or caused the instruction pointer to land in an invalid range from completely random places. Just what was going on there?

Then we realized that the invalid instruction pointer addresses often looked suspiciously similar to heap data addresses. Could it be that the stack was getting corrupted and the execution was jumping into data memory instead of the code memory? But why would the stack get corrupted?

After countless hours spent single-stepping through code and looking at memory dumps, we started to see a pattern. Sometimes, during execution of random code, part of the stack was getting overwritten with unknown data. There still didn’t seem to be any pattern on the code location where that corruption was happening, but it was a start.

As the crash location seemed to be completely non-deterministic, there was no other choice than to single-step through the entire code. We ended up with a GDB script that performed single-stepping until a suspicious address appeared on the stack. And after leaving it overnight, we finally got a hit that made some sense.

The compiler we were using was compiling each function to a code that looked like this:

; function prologue

stmdb sp!, {r3, r4, r5, r6, r7, r8, r9, r10, r11, lr}

add.w r11, sp, #36

; ... actual function code ...

; function epilogue

mov sp, r11

sub sp, #36

ldmia.w sp!, {r3, r4, r5, r6, r7, r8, r9, r10, r11, pc}

You may not be familiar with the ARM assembly, so here’s a little summary of what is going on:

  1. The stmdb sp! instruction is basically a fancy name for push – it pushes all the registers listed in the curly brackets onto the stack.
  2. The add.w instruction adds 36 to the value of the stack pointer and stores it in the r11 register. The r11 register is canonically used in the ARM ABI as a stack frame pointer (also called fp).

In the end, the situation after the function prologue looks like this:

CPU register values pushed onto stack memory after function prologue

  • 3. In the epilogue, mov sp, r11 assigns the value stored in r11 back to the stack pointer.
  • 4. Then the sub instruction moves it back by 36 bytes.
  • 5. Finally, ldmia.w sp! is popping all the values back from the stack to the registers. Note the pc register where lr
  • was in the stmdb instruction above – ARM processors use lr to store the return address, so the return
  • address is restored onto the program counter, which makes the pop instruction double as a function return.

This looks like a fairly standard stack frame, and indeed it works perfectly on many platforms, including everything that uses Cortex-A and full-fledged operating systems, such as the Raspberry Pi and Android devices. However, the Cortex-M family, being microcontrollers without memory protection features, does something interesting during interrupt handling, as we can read in the Exception entry and return chapter of the Cortex-M4 manual:

When the processor takes an exception, unless the exception is a tail-chained or a late-arriving exception, the processor pushes information onto the current stack. This operation is referred to as stacking and the structure of eight data words is referred as the stack frame.

Now, look at the assembly above once again. If an exception (i.e., an interrupt) happens between the mov sp, r11 and sub sp, #36 instructions… well, the exception handler’s stack frame will overwrite the data that the ldmia.w instruction is supposed to restore to the registers!

The question remained – why was the compiler generating such code, that was essentially undefined behaviour under the target processor’s architecture? A compiler sold by the processor core developer at that?

Well, it turned out that this is standard code generated by at least some compilers when an option to “use frame pointer” is enabled. You might consider it a compiler bug, but this code is not wrong when used on Cortex-A processors. You just can’t use it when targeting the Cortex-M family.

It turns out that in the toolchain configuration that we were using, for some reason, this option is enabled by default. We needed to explicitly disable it by passing an appropriate command-line option to the compiler. So we did, and the code worked perfectly ever since.

Conclusion – is embedded development for you?

The challenges associated with embedded development may not be as big as those that modern operating system or virtual machine developers need to face. Microcontrollers don’t have caches that might cause Meltdown-style vulnerabilities nor big.LITTLE architectures with subtle yet deadly differences between processor cores. Still, we need to be always aware of all the possible consequences of every line of our code, and look out for bizarre bugs such as those described above.

You can look at some of our code (not really tied to embedded development, though) on – the majority of our LwM2M client library known as Anjay is open source.

by Mateusz Kwiatkowski

This website uses cookies. By continuing to use this website you are giving consent to cookies being used.