Why I spent my Christmas Day building native Arm compilers

My day job involves, among a fairly wide variety of other things, rebuilding compilers. So why did I find myself doing the same thing on my day off?

What I did in my holidays

By Daniel Thompson, aged 40

I had been looking forward to Christmas more then usual this year. In November we bought some drawers with frosted glass fronts and I had been designing a custom lighting system for them in my mind during long hours washing up through most of the pre-Christmas period.

On Christmas day I finally got the chance to hook up my strings of WS2812 LEDs and start hacking. With many LEDs carefully wrapped around and paper clipped a bog roll and with probes dangling off everywhere I decided it was far too fragile to hook up my laptop so I grabbed a spare Dragonboard 410C before carefully balancing everything and connecting it up to my longstanding Cortex-M3 based microcontroller board of choice.

It was only when I tried to do the initial flashing that I realised there was no arm-none-eabi-gdb in Debian Buster/arm64. There’s a cross-compiler but no debugger. Without a second thought I rocked over to download the Arm embedded toolchain from the Arm website but found the cupboard was bare, or as least there were no AArch64 compiler binaries I could use.

After a brief detour of downloading the source code on the DB410C before deciding that building a complex multilib toolchain on a four-core A53 system (with a tiny 8G eMMC) was insane. Shortly afterwards I went upstairs and woke my Developerbox from it’s Christmas afternoon nap (actually it had been napping since I stopped work a few days earlier).​

After wrangling LXC to create the Ubuntu 14.04 container needed for the build I set about trying to recompile the toolchain. Arm’s instructions were awesome and even mentioned how to prevent any attempt to build using mingw32 (which would have been doomed to fail when building on AArch64).

After that there was just enough time for a nice game of Good News, Bad News to complete the holiday period.

Good news: Arm’s instructions worked perfectly and a brand new toolchain emerged from the other end of the sausage grinder.

Bad news: It took my Developerbox over 8 hours to grind through all the builds and as a result I didn’t get to play with my new toys until Boxing Day.

Good news: (Some of) my software worked first time.

Bad news: Except for some apparent toolchain bugs that I haven’t yet had the chance to check (during working hours) to find out if they are AArch64 specific ;-).

Just another day in the life of an Arm-on-Arm developer!


Experimenting with 64k pages for AArch32 code

Someone asked me about 64K pages and the AArch32 ABI again recently. It’s a question that has passed across my desk multiple times and even followed me through multiple companies. Given that long history, and the changes made to the Arm toolchains to ensure freshly built ELF binaries can be loaded, I was interested to see whether Debian 9 (Stretch) armhf userspace would on a machine with 64K pages. I also had access to a Developerbox to help me indulge my curiosity. The short answer is that it is *not* possible to boot Debian Stretch armhf container on a machine with 64k pages because the kernel cannot map the init process… but is only half the story; it really was very close to working!

Continue reading “Experimenting with 64k pages for AArch32 code”


Running an ISO installer image for arm64 (AArch64) using QEMU and KVM

Scattered across myriad blogs around the internet you will find many different ways to boot GNU/Linux for arm64 (a.k.a. AArch64) using QEMU with or without KVM. However, when I recently wanted to quickly spin up a KVM VM on my Developerbox using the Debian Installer ISO images, I couldn’t find any end-to-end instructions. There is lots of great information out there but I had to assemble the fragments myself. Having done that I thought I would share the results…

Continue reading “Running an ISO installer image for arm64 (AArch64) using QEMU and KVM”


Getting started with GStreamer 1.0 and Python 3.x

Way back in the mists of time (or a little over nine years ago if you prefer). Jono Bacon wrote a very detailed blog post describing how to use GStreamer with Python.

Getting started with GStreamer with Python

Mr. Bacon went into a lot of detail, so much so that now, almost ten years later it is still widely credited in other blog posts and remains highly ranked by search engines.

However both Python and GStreamer have moved on a bit over the last decade. The bindings too have moved on a lot as they now use the almost unspeakably awesome PyGObject to automatically generate most of the bindings by introspection.

In short, Jono’s code doesn’t work any more. However it doesn’t take much work to massage the first example until it does.

#!/usr/bin/env python3

import gi
gi.require_version('Gst', '1.0')
gi.require_version('GstBase', '1.0')
gi.require_version('Gtk', '3.0')
from gi.repository import GObject, Gst, GstBase, Gtk, GObject

class Main:
 def __init__(self):

 self.pipeline = Gst.Pipeline("mypipeline")

 self.audiotestsrc = Gst.ElementFactory.make("audiotestsrc", "audio")

 self.sink = Gst.ElementFactory.make("autoaudiosink", "sink")




Roughly speaking “all” we have had to do to update this example:

  1. Update the imports to gather everything we need from the gi module.
  2. Add: Gst.init(None) (this should probably be Gst.init(sys.argv) but that’s not how the original code behaves so it’s not in this port either)
  3. Replace the lowercase g with an uppercase G in both Gst and Gtk.
  4. Tweak the Gst.ElementFactory and Gst.State calls; these were in a flatter namespace in the older PyGst bindings.
  5. Replace alsasink with autoaudiosink. Strictly speaking this is not required; alsasink will still work just fine. However autoaudiosink can adopt pulseaudio when available. Something else that has changed since this code was originally written.

… and that’s it. Not much to it really. Hopefully its enough to set you on your way if you want to grab ideas from old tutorials and blog posts into your own shiny new GStreamer application.

Happy hacking!


How to make a Dragonboard 410c run “headless”

The Dragonboard 410c is a great platform but currently there is a problem with the wireless driver that makes it very hard to run it headless and connect to services running on the board via wifi.

At present the board does not respond correctly to broadcast packets. This results in a number of issues but by far the most significant is that the board cannot reply to ARP requests and this prevents other devices from opening connections to it. From the client machines point of view it knows the board is there, and it can look it up using DNS, but it can’t take the final steps needed to communicate with the board.

In the example below we have a client machine, birch.lan, unable to access dragonboard.lan, because it cannot find the right MAC address.

birch# arp -e
Address         HWtype HWaddress
router.lan      ether  40:ba:fa:xx:xx:xx
dragonboard.lan        (incomplete)
wychelm.lan     ether  fc:aa:14:xx:xx:xx

If only we could get the MAC address into birch’s ARP cache then there would be no problem doing basic network activity like connecting an SSH server. Thankfully there is a fairly easy way to automate this…

Get the Dragonboard to do a nmap ping-sweep of the subnet.

To be clear this is a hack of such huge proportions that it deserves more than just those four letters! It is a bodge, a sticking plaster, a nasty solution stuck together with sticky tape and glue, it does not make me feel warm and snuggly inside… however… it does work.

Not only does it work but it can also be trivially hooked up to systemd so that the device can periodically shout out its presence.

Firstly we need to create a simple service to launch the ping sweep (actually it will be much more like an arping sweep):

# /etc/systemd/system/share-mac.service
# Running an nmap ARP ping sweep will encourage other
# devices to cache this boards MAC address. This
# works around a problem where the Dragonboard
# 410c does not correctly respond to ARP
# requests.
# Note that although nmap's port scanning is
# disabled it remains possible that the host
# discovery protocol (which is not *actually*
# as simple as a ping sweep) may still trigger
# an IDS if run on a corporate network.
# Take care!

Description=Send our MAC address to other devices

# Change the IP address range as needed...
ExecStart=/usr/bin/nmap -PR -sn -n -e wlan0

The above service runs just once and it would be enough to get the board noticed after it boots but eventually the other devices would evict the ‘410c from their ARP caches and it would fall off the network again. To solve this we can use a systemd timer to ensure we always keep poking the other devices:

# /etc/systemd/system/share-mac.timer

Description=Repeatedly send out our MAC address

Finally we need to run a few commands on the ‘410c to install nmap and ensure systemd runs the above files:

apt-get update
apt-get install nmap
systemctl enable share-mac.timer
systemctl start share-mac.timer

With these changes in place the ‘410c should announce its existence to all hosts on the network shortly (~30 seconds) after booting and it will repeat the sweep every 5 minutes to ensure the ARP caches stay nice and hot going forward.

The results of the above scripts showed up straight away in birch.lan‘s ARP cache:

birch# arp -e
Address         HWtype HWaddress
router.lan      ether  40:b0:fa:xx:xx:xx
dragonboard.lan ether  02:00:7d:xx:xx:xx
wychelm.lan     ether  fc:aa:14:xx:xx:xx

Share and enjoy!

Update #1 (2016-01-15):  Updated the nmap command with -PR (ensure pure ARP ping sweep), -n (no reverse DNS lookup) and -e wlan0 (only deploy workaround on a specific network interface). Reduced the interval between sweeps from 5 mins to 3 mins to better handle congested networks (10 minutes ARP cache timeouts are common). Thanks to snowbird and ldts-jro for their feedback with this.


Debugging ARM kernels using fast interrupts [LWN.net]

I suspect that there are relatively few regular readers of this blog. However if you are one of them and are feed up with hastily written articles with inadequate proof reading then may I recommend you take a look at my recent article for lwn.net describing some of my recent Linux kernel work:

Debugging ARM kernels using fast interrupts

Not only did I proof read it, proof read it and proof read it again but the terrific folks over at lwn.net did the same resulting in an article I’m really proud of.

PS if you are not an lwn.net subscriber then you’ll have to wait until next week to read it…


How loop optimization in GCC uses undefined behaviour to make inferences

There are many ways to try and understand how an optimizing C compiler might transform your code.

Lets start by thinking about the goal of an optimizing compiler. Its goal is to manipulate the code that it is presented with, making modifications to cause it to execute more quickly and without altering the observable behaviour of a correctly formed program.

It is very important when describing the effects of an optimizer to remember that its only obligation it to preserve the behaviour of a correctly formed program. If your program is not correctly formed (normally expressed by compiler writers as relying upon undefined behaviour) then the optimizer is allowed to make modifications. That’s it! These modifications need not  preserve observable behaviour. They don’t have to execute more quickly. They don’t have to try to guess the programmer’s intent. The compiler is allowed to make modifications of more or less any form.

Over the years optimizing compilers have come to make inferences based on undefined behaviour. One of the simplest ones occurs when a pointer is dereferenced. In this case the compiler knows that the pointer cannot be NULL (because it has already been dereferenced and dereferencing a NULL pointer results in undefined behaviour). As a result it can the treat any later checks for pointer validity as unreachable code and remove it.

Recently GCC gained a much more powerful mechanism to track undefined behaviour that occurs due to out of range loop indicies. Having read about this and looked at a few spurious compiler bug reports I settled down and wrote the following plausible but buggy code.

Note: If you like puzzles then don’t scroll down past the return 0; statement and closing brace. That way you can try and spot the bug yourself. As far as I know there is only one in there!

int is_whitespace(char c)
    const char lookup_table[] = { ' ', '\t', '\n' };

    for (int i=0; i<=sizeof(lookup_table); i++)
        if (c == lookup_table[i])
            return 1;

    return 0;

Spotting the error is hopefully the easy part.

If you need a clue then be reassured that the lookup table contains three elements (it is initialized from character literals rather than a string literal so it does not get nil-terminated).

If you just want to know what the bug is so you can keep reading this article then have a look at the loop exit condition. It uses <= rather than < meaning the loop exit condition is reached after four cycles round the loop. This results in reading after the end of the array leading to undefined behaviour.

Now for the hard bit. What do you expect to happen when you run this code?

One answer, and one that I might have given had I been asked this question instead of writing it is: “strictly speaking this is undefined so it needs fixing, however the padding put in by the linker means that on most systems lookup_table[3] will evaluate to ‘\0’ and so the function will work more of less as intended (assuming the caller never cares whether the ‘\0’ character is whitespace or not)”.

That answer might even have been adequate two years ago. However with modern compilers it is better to rely on the simpler answer regarding what can happen: “anything”.

Nevertheless, even knowing all of the above the transformation the compiler actually makes may still yet surprise you. Unconditionally the code comes out as:

int is_whitespace(char c)
    return 1;

If you don’t believe me look at the assembler (generated using gcc-4.8.2 with -O2 -std=c99 -S on x86-64).

        .file   "g.c"
        .p2align 4,,15
        .globl  is_whitespace
        .type   is_whitespace, @function
        movl    $1, %eax
        .size   is_whitespace, .-is_whitespace
        .ident  "GCC: (GNU) 4.8.2 20131212 (Red Hat 4.8.2-7)"
        .section        .note.GNU-stack,"",@progbits

What has happens is the compiler has realized that when i == 3 the code makes an undefined memory read. From this the compiler infers that i must always remain strictly less than three during execution of the function and that, for this reason, the end condition is unreachable. It can be optimized away and, because this change means the loop can never exit then we also know that the function can only return 1 so we can get rid of the loop itself as well.

As a further exercise imagine what happens if the loop contains a function call that is opaque to the optimizer (meaning that the compiler doesn’t know what side effects the function has and therefore cannot remove it). Now we still know that the end condition is unreachable but we need to call the function a data dependant number of times. Of course the compiler still believes the end condition is unreachable and can optimize it away. This results an infinite loop!

If you’re currently busy moaning about arrogant out-of-touch compiler writers who don’t understand the real world then please stop. Compiler writers tend to dog food (by compiling their compilers using their own compilers) and are just as in touch with the real world as every other working programmer. Perhaps instead you could thank them for transforming a rare, hard to tickle bug that could easily slip into production, into a deterministic one that should be caught by even the most simple of tests. That’s much easier to debug.

Of course the other thing you get out of the dedication of your compiler writer is faster code. In a system where there is heavy inlining of defensively written code (for example the C++ standard library) then, in principle, throwing away unreachable loop end conditions could pay significant dividends.


Use “#!/usr/bin/env hbcxx” to make C++ source code executable

#! C++I normally write some kind of personal toy during the holiday season. For example last year I wrote a toy fibre scheduler to go with a microcontroller project I was working on. This year however I’ve cooked up something and can’t quite decide if its a great idea, a pointless idea or a stupid idea. One thing is clear however, to find out which of the three possibilities it is, this bit of code needed packaging up properly as a product and shared with the wider world. Basically hbcxx uses the Unix #!/path/to/interpreter technique to make C++ source code directly executable.I’ve been taking a new look at C++. There is a palpable sense of “buzz” in the C++ community as they realize that, with C++11, they are sitting on something pretty special. The advocacy from the presenters at Going Native this year was remarkably effective (although if you take my advice you won’t watch Scott Meyer’s brilliant Effective C++14 Sampler until you know what std::move is for).
Quoting Bjarne Stroustrup: Surprisingly, C++11 feels like a new language. Considering its source it is not at all surprising that this quote is absolutely on the money: modern C++, meaning C++11 or later, does feel like another language. This is not because the language has been changed massively but because the new features encourage a different, and slightly higher level way to think about writing C++. It’s faster and more fun, supports lambdas, has tools to simplify memory management and includes regular expressions out-of-the-box.I was actually pretty amazed to see regular expressions in the standard C++ libraries, so that coupled with humane memory management (albeit humanity where you have to explicitly opt-in) and the auto keyword really got me thinking differently about writing C++. auto even encouraged me to write a template (generic programming is so much easier when you don’t have to explicitly declare the type of every expression). All this and without losing type safety…So my great/pointless/stupid idea (delete whichever is inappropriate) is a tool to keep things fast and fun by putting off the moment you have to write a build system and install script. For simple programs, especially for quick and dirty personal toys and scripts, the day you have to write a proper build system may never come. You no longer want the distraction of making a separate directory and a Makefile and you’ll find that pkg-config to just work.Instead I just copy your C++ source code into $HOME/bin. Try it. It works.Features include:

  • Automatically uses ccache to reduce program startup times (for build avoidance).
  • Enables -std=c++11 by defualt.
  • Parses #include directives to automatically discover and compile other source code files.
  • Recognises the inclusion of boost header files and, where needed automatically links the relevant boost library.
  • pkg-config integration.
  • Direct access to underlying compiler flags (-O3, -fsanitize=address, -g).
  • Honours the CXX environemnt variable to ensure clean integration with tools such as clang-analyzer’s scan-build.

To learn more about hbcxx take a look at:

Then have fun.



How C++11, threads and lambda capture come together.

Somehow when I first read about C++ lambda functions I overlooked the way in which they capture variables. What can I say, I’m pretty familiar with python and just somehow expected variables to be automatically captured…

Be that as it may, I would like to present you with some broken code of the type I wrote when I first set out working on some of this stuff:

#include <atomic>
#include <iostream>
#include <thread>

using namespace std;

class threads_and_lambda_capture {
    thread worker;
    atomic<bool> running;

    void process()
        while (running) {
            cout << "Process is running" << endl;

        : worker{}
        , running{true}

    virtual ~threads_and_lambda_capture()
        // With thanks to Scott Meyers for making this topic the
        // subject of a presentation... if you forget to stop a
        // joinable thread then the threads destructor will
        // terminate the program.

    void start()
        running = true;
        worker = thread{[]() { this->process(); }};  //  <== BUG!!!

    void stop()
        if (worker.joinable()) {
        running = false;

Most of the class above is merely scaffolding to show you a little context. The focus of the rest of this post is exclusively the start() method:

void start()
    running = true;
    worker = thread{[]() { this->process(); }};  //  <== BUG!!!

The intent of the above code was to make a method belonging to the current class as the thread entry point. The bug in the above code is that the this pointer has not been captured by the lambda and is therefore undefined in the lambda’s body.

After a little time in the company of google I replaced my code with the following working code and moved on:

void start()
    running = true;
    worker = thread{&thread_and_lambda_capture::process, this};

Somehow this irked me slightly. For me, the real value of modern C++ (including C++11 and the soon to be released C++14) is that it takes down the level of expertise required to utilize its power.  For that reason the above just seemed too knowledge intensive to be modern C++.

Thankfully I was right! When I discovered, in a completely different context, more about lambda capture I revisited the code above (and decided to write a blog post about it).

void start()
    running = true;
    worker = std::thread{[this]() { this->process(); }};
    //                   ^^^^^^       ...and BUG is gone

To explain what’s happening here, lambdas in C++11 consist of the [] variable capture section, the optional () parameter list section and the {} body. With this added to the list of captured variables then the symbol in the lambda body resolves correctly. Note that I have captured this by value (making a copy of it) rather than by reference for the same reasons I would apply to method calls. As usual, objects that cannot (or should not) be copied are better captured by reference.

An short but information dense introduction to C++ lambdas can be found at: cppreference.com .

It is true that the above code is probably still not particularly pleasing to a novice (and I suspect it may also be slightly slower although I have not read the disassembly to check this). However the knowledge demonstrated is likely to be far more transferable to other problem domains than knowing the arcane behaviour resulting from taking the address of contains member functions. So much so that a novice tutored in modern way (meaning parts of STL are introduced on the first day) will probably already have come across the concept of variable capture long before multi-threading raises its head!

So… with my new found transferable knowledge I’m going to finish the lambda function I need to filter a list.


Measuring a ’12 Nexus 7’s microphone to speaker latency

The recent release of patchfield for Android made me wonder whether Android’s audio system has developed sufficiently over the last three years to be suitable for real time signal processing. There have been a couple of developers at google working hard to reduce the output latency but their work does not yet encompass input latency.

However since patchfield allows the speaker to be digitally connected to the microphone it is easy to use patchfield itself to measure input to output latency of an Android system.

The following picture shows the test set up I used to measure it.


On the left is my 2012 Nexus 7. This is connected to the earphones sitting in top of the (red) sound card. Using earphones prevents feedback between the mic and speaker. Finally a microphone rests on top of the earphones. Critically the microphone picks up both the environment and the output of the earphones.

If you are particularly eagle eyed you might have notices a few missing cables in the above picture. Specifically the earphones are not actually plugged into the Nexus and the microphone is not plugged into the soundcard! I haven”t faked the photo however I only decided to blog about it after I had starting taking things apart again. In may haste to get a picture of it I forgot to plug it all back together.

Latency is measured simply by knocking in the table surface. This creates a short burst of noise that is captured both by the microphone in the picture and by the Nexus’ internal microphone. Patchfield captures the sound and replays it to the earphone which the microphone also picks up. The latency can then by calculated by looking at the waveform of the signals picked up by the microphone.

And the results are… 132ms from input to output. Just to put this number into perspective a piano, which is one of the highest latency musical instruments, takes about 25ms from the start of a keypress to the hammer striking a string. Another way to express it is that it is way too much for any musical application. The brain doesn’t just hear it as an echo… it hears it as a LOOONNNGG echo!

I repeated the same test on my Nexus 4 and that was better but not by enough to make much difference. That “only” took 124ms.

I guess there remains lots of very good reasons by iRig is not available for Android. There is one remaining ray of hope: the latest Samsung Mobile SDK includes a professional audio API based on jack and dedicated to handling low latency. It looks like Samsung really do want to take the fight to apple.