Exploring Startup Implementations: Newlib (ARM)

For most programmers, a C or C++ program's life begins at the main function. They are blissfully unaware of the hidden steps that happen between invoking a program and executing main. Depending on the program and the compiler, there are all kinds of interesting functions that get run before main, automatically inserted by the compiler and linker and invisible to casual observers.

Unfortunately for programmers who are curious about the program startup process, the literature on what happens before main is quite sparse.

Embedded Artistry has been hard at working creating a C++ embedded framework. The final piece of the puzzle was implementing program startup code. To aid in the design of our framework's boot process, I performed an exploratory survey of existing program startup implementations. My goal is to identify a general program startup model. I also want to provide a more comprehensive look into how our programs get to main.

In this six-part series, we will be investigating what it takes to get to main:

  1. A General Overview of What Happens Before main()
  2. Exploring Startup Implementations: Newlib (ARM)
  3. Exploring Startup Implementations: OS X
  4. Exploring Startup Implementations: Custom Embedded System with ThreadX
  5. Abstracting a Generic Flow for Getting to main
  6. Implementing our Generic Startup Flow

Now that we have a high-level understanding of how our programs get to main, we can explore real-world implementations of program startup code.

Today's analysis focuses on Newlib. If you build embedded applications for ARM using the GNU arm-none-eabi toolchain, your program is linked with Newlib startup code by default. Newlib supports multiple architectures, but we will focus exclusively on the ARM startup path.

If you are interested in exploring Newlib startup routines on your own, you can download the Newlib source code or browse the source code online.

The boot flow is quite complicated, and it's easy to get mentally lost. You can refer to the Visual Summary throughout the article for a visual representation of the startup procedure and call stack.

Table of Contents:

  1. ARM Procedure Call Standard
  2. System Configuration
  3. Initial Exploration
    1. Boot Path
    2. _start Disassembly
  4. nRF52 Initial Boot
    1. Load from Flash to RAM
    2. Optional: Clear .bss
    3. SystemInit
    4. Call start
    5. IRQ Handlers
  5. nRF52 System Initialization
  6. Newlib ARM Startup
    1. crt0.s
      1. Stack Setup
      2. Initialize .bss
      3. Target-Specific Initialization
      4. argc and argv Initialization
      5. Call Global Constructors
    2. __libc_init_array
    3. __libc_fini_array
    4. Heap Limit and malloc
    5. atexit Family
      1. atexit
      2. __cxa_atexit
      3. __register_exitproc
      4. Automatic Registration of Destructors
    6. exit Family
      1. exit
      2. _exit
      3. __call_exitprocs
      4. _kill
  7. Visual Summary
  8. Startup Activity Checklist
  9. Further Reading

ARM Procedure Call Standard

Since we are going to look at ARM assembly, we will need to familiarize ourselves with the basics of the Procedure Call Standard for ARM Applications.

There are sixteen 32-bit registers and a status register (CPSR) in the ARM and Thumb instruction sets:

  • r0 (aka a1) is Argument register 1 and a result register
  • r1 (aka a2) is Argument register 2 and a result register
  • r2 (aka a3) is Argument register 3
  • r3 (aka a4) is Argument register 4
  • r4 (aka v1) is Variable register 1
  • r5 (aka v2) is Variable register 2
  • r6 (aka v3) is Variable register 3
  • r7 (aka v4) is Variable register 4
  • r8 (aka v5) is Variable register 5
  • r9 usage changes depending on the platform
  • r10 (aka v7) is Variable register 7
  • r11 (aka v7) is Variable register 8
  • r12 is the IP special purpose register (intra-procedure-call scratch register)
  • r2` is the SP special register (stack pointer)
  • r14 is the LR special register (link register)
  • r15 is the PC special register (program counter)

The standard says the following for the argument registers (r0-r3):

The first four registers r0-r3 (a1-a4) are used to pass argument values into a subroutine and to return a result value from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).

We have multiple registers to hold the value of local variables:

Typically, the registers r4-r8, r10 and r11 (v1-v5, v7 and v8) are used to hold the values of a routine’s local variables. Of these, only v1-v4 can be used uniformly by the whole Thumb instruction set, but the AAPCS does not require that Thumb code only use those registers.

We must preserve specific registers when calling functions:

A subroutine must preserve the contents of the registers r4-r8, r10, r11 and SP (and r9 in PCS variants that designate r9 as v6)

ARM specifies that the stack pointer (SP) must always be aligned to a word boundary (i.e., sp % 4 == 0). For public interfaces, the stack must be aligned to a double-word boundary (i.e., sp % 8 == 0).

The least significant bit of a function address is an ARM/Thumb flag (1 == ARM, 0 == Thumb). This bit is set by the linker.

When we want to call a subroutine, we need to preserve the current function's persistent registers on the stack, store the return address in the LR register (so we know how to get back from our function), and change the PC to the subroutine address. ARM provides branching instructions which handle this process for us (e.g., bl, blx,bx`), although the process may still be performed manually.

Now, there are many details that we did not cover, but this basic overview provides enough details to understand some of the assembly that we will be analyzing. Particularly important to keep in mind: values put into r0-r3 represent arguments to functions, and values put into r4-r11 represent variables used in our current subroutine.

System Configuration

For this exploration, I used a Nordic nRF52840 Development Kit. The development kit has several examples provided by Nordic; I used the blinky program. I compiled and linked the program with the GNU ARM toolchain (version 8-2018-q4-major). The Nordic blinky program links against the Newlib libraries provided by the GNU ARM toolchain.

Because this is a Cortex-M processor, the program is compiled entirely in Thumb mode. We will discuss some aspects of the boot process which apply to Cortex-A processors that use ARM instructions.

Initial Exploration

Before we start blindly looking through the Newlib code base, we should do some initial exploration with our debugger as described in the last article.

To begin the investigation, I compiled the blinky example for the nRF52840 Development Kit (PCA10056 in the SDK parlance) in the "blank" configuration using the armgcc Makefile. I flashed the binary to the board with the nRF Connect Programmer,

First, let's start with a backtrace from main in an example program so we can see what code is run. Then we will look at the disassembly for the _start function that is provided by Newlib.

Boot Path

To investigate the path our program takes to get to main, we'll use gdb. The nRF52 DK has USB connection with an on-board debugging chip. I fired up a Jlink gdb server and connected to my board usingarm-none-eabi-gdb`.

Once the board is connected, we load the symbols for our application:

(gdb) file _build/nrf52840_xxaa.out
A program is being debugged already.
Are you sure you want to change the file? (y or n) y
Reading symbols from _build/nrf52840_xxaa.out...

Set the breakpoint for main:

(gdb) b main
Breakpoint 1 at 0x380: file ../../../main.c, line 62.

Enable backtraces to extend past main:

(gdb) set backtrace past-main on

Then restart and run the program:

(gdb) mon reset
Resetting target
(gdb) c

Breakpoint 1, main () at ../../../main.c:62
62      bsp_board_init(BSP_INIT_LEDS);

Our initial backtrace shows a corrupt frame prior to _start:

(gdb) bt
#0  main () at ../../../main.c:62
#1  0x0000028e in _start ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

This can happen when the _start routine is messing with stacks or frame pointers to set up the program according to the library and ABI requirements. We can confirm this by setting a breakpoint at _start and re-starting the program. This will allow us to look at the state of the program before stack modifications.

(gdb) b _start
Breakpoint 2 at 0x258
(gdb) mon reset
Resetting target
(gdb) c

Breakpoint 2, 0x00000258 in _start ()
(gdb) bt
#0  0x00000258 in _start ()
#1  0x000002ce in Reset_Handler () at ../../../../../../modules/nrfx/mdk/gcc_startup_nrf52840.S:280

Our program receives control at the Reset_Handler function in our processor's startup code. This is expected for an embedded platform, since the processor loads our program from memory and begins execution at the reset vector address.

Now we know that there are two areas to investigate for startup, and gdb helpfully provided the path to the gcc_startup_nrf52840.S file, which is where our investigation of the source code will begin.

_start Disassembly

Before we dive into the source code, let's look at the disassembly for the _start function with gdb.

(gdb) disass /m _start
Dump of assembler code for function _start:
0x00001240 <+0>: ldr r3, [pc, #84] ; (0x1298 <_start+88>) 
0x00001242 <+2>: cmp r3, #0 
0x00001244 <+4>: it eq 
0x00001246 <+6>: ldreq r3, [pc, #76] ; (0x1294 <_start+84>) 
0x00001248 <+8>: mov sp, r3 
0x0000124a <+10>: sub.w r10, r3, #65536 ; 0x10000 
0x0000124e <+14>: movs r1, #0 
0x00001250 <+16>: mov r11, r1 
0x00001252 <+18>: mov r7, r1 
0x00001254 <+20>: ldr r0, [pc, #76] ; (0x12a4 <_start+100>) 
0x00001256 <+22>: ldr r2, [pc, #80] ; (0x12a8 <_start+104>) 
0x00001258 <+24>: subs r2, r2, r0 
0x0000125a <+26>: bl 0x330c <memset> 
0x0000125e <+30>: ldr r3, [pc, #60] ; (0x129c <_start+92>) 
0x00001260 <+32>: cmp r3, #0 
0x00001262 <+34>: beq.n 0x1266 <_start+38> 
0x00001264 <+36>: blx r3 
0x00001266 <+38>: ldr r3, [pc, #56] ; (0x12a0 <_start+96>) 
0x00001268 <+40>: cmp r3, #0 
0x0000126a <+42>: beq.n 0x126e <_start+46> 
0x0000126c <+44>: blx r3 
0x0000126e <+46>: movs r0, #0 
0x00001270 <+48>: movs r1, #0 
0x00001272 <+50>: movs r4, r0 
0x00001274 <+52>: movs r5, r1 
0x00001276 <+54>: ldr r0, [pc, #52] ; (0x12ac <_start+108>) 
0x00001278 <+56>: cmp r0, #0 
0x0000127a <+58>: beq.n 0x1282 <_start+66> 
0x0000127c <+60>: ldr r0, [pc, #48] ; (0x12b0 <_start+112>) 
0x0000127e <+62>: nop.w 
0x00001282 <+66>: bl 0x32b4 <__libc_init_array> 
0x00001286 <+70>: movs r0, r4 
0x00001288 <+72>: movs r1, r5 
0x0000128a <+74>: bl 0x1554 <main()> 
0x0000128e <+78>: bl 0x3268 <exit> 
0x00001292 <+82>: nop 
0x00001294 <+84>: movs r0, r0 
0x00001296 <+86>: movs r0, r1 
0x00001298 <+88>: movs r0, r0 
0x0000129a <+90>: movs r0, #4 
0x0000129c <+92>: movs r0, r0 
0x0000129e <+94>: movs r0, r0 
0x000012a0 <+96>: movs r0, r0 
0x000012a2 <+98>: movs r0, r0 
0x000012a4 <+100>: lsls r0, r4, #3 
0x000012a6 <+102>: movs r0, #0 
0x000012a8 <+104>: lsls r4, r4, #10 
0x000012aa <+106>: movs r0, #0 
0x000012ac <+108>: movs r0, r0 
0x000012ae <+110>: movs r0, r0 
0x000012b0 <+112>: movs r0, r0 
0x000012b2 <+114>: movs r0, r0

Disassembly Highlights

We won't reconstruct the entire process from disassembly, but we can quickly note some highlights.

First, the routine sets up the stack pointer using the r3 register:

0x00001248 <+8>: mov sp, r3

The Newlib _start function handles initializing the .bss section contents (which holds uninitialized global and static data) to 0. Note the call to memset: r1 holds the value we are setting ('0'); r0 holds the start address of the .bss section; r2 is loaded with the end address of the .bss section, and then the start address is subtracted from it, giving us the size of the section.

0000124e <+14>: movs r1, #0 
0x00001254 <+20>: ldr r0, [pc, #76] ; (0x12a4 <_start+100>) 
0x00001256 <+22>: ldr r2, [pc, #80] ; (0x12a8 <_start+104>) 
0x00001258 <+24>: subs r2, r2, r0 
0x0000125a <+26>: bl 0x330c <memset>

From the disassembly, I don't immediately understand what's happening after memset, but I do notice some function calls (blx instructions). I'm also guessing that _start initializes argc and argv to 0, then preserves those in r4-r5. Looking at the commented and non-optimized source will clarify this part of the process.

I do recognize the next function call, which is conveniently named. This call will initialize the global constructors:

0x00001282 <+66>: bl 0x32b4 <__libc_init_array>

After we've called the global constructors, we put the (presumed) argc and argv values into our argument registers, and then call main:

0x00001286 <+70>: movs r0, r4 
0x00001288 <+72>: movs r1, r5 
0x0000128a <+74>: bl 0x1554 <main()>

Since the r0 register holds the value that main returns, we can invoke exit without needing to modify the argument registers:

0x0000128e <+78>: bl 0x3268 <exit>

The assembly instructions following exit is a mystery to me from this view. Let's see what the source investigation reveals.

nRF52840 Boot

Our backtrace showed us that our journey begins in the Reset_Handler function in gcc_startup_nrf52840.S (found in the nRF SDK).

The file begins by providing for stack storage:

.section .stack
#if defined(__STARTUP_CONFIG)
    .equ    Stack_Size, __STARTUP_CONFIG_STACK_SIZE
#elif defined(__STACK_SIZE)
    .align 3
    .equ    Stack_Size, __STACK_SIZE
    .align 3
    .equ    Stack_Size, 8192
    .globl __StackTop
    .globl __StackLimit
    .space Stack_Size
    .size __StackLimit, . - __StackLimit
    .size __StackTop, . - __StackTop

There are also provisions for heap storage:

.section .heap
    .align 3
#if defined(__STARTUP_CONFIG)
    .equ Heap_Size, __STARTUP_CONFIG_HEAP_SIZE
#elif defined(__HEAP_SIZE)
    .equ Heap_Size, __HEAP_SIZE
    .equ Heap_Size, 8192
    .globl __HeapBase
    .globl __HeapLimit
    .if Heap_Size
    .space Heap_Size
    .size __HeapBase, . - __HeapBase
    .size __HeapLimit, . - __HeapLimit

This file also contains a declaration of all interrupt vectors and their associated handlers. A small sample is shown:

.section .isr_vector
    .align 2
    .globl __isr_vector
    .long   __StackTop                  /* Top of Stack */
    .long   Reset_Handler
    .long   NMI_Handler
    .long   HardFault_Handler
    .long   MemoryManagement_Handler
    .long   BusFault_Handler
    .long   UsageFault_Handler

    /// ...

    .size __isr_vector, . - __isr_vector

We then find the declaration of Reset_Handler:

    .align 1
    .globl Reset_Handler
    .type Reset_Handler, %function

Load from Flash to RAM

First, the reset handler copies data from flash to RAM.

The data is copied from the address of the __etext symbol, which represents the end of the .text section in flash storage. The data is copied to the address indicated by the __data_start__ symbol, and the number of bytes copied is calculated by subtracting the __data_start__ address from __bss_start__, which indicates the beginning of the next section. As the nRF startup code explains, __bss_start__ is used so users can insert their own initialized data section before the .bss section. Using this logic, it will be copied to RAM without any changes from the user.

ldr r1, =__etext
    ldr r2, =__data_start__
    ldr r3, =__bss_start__

    subs r3, r3, r2
    ble .L_loop1_done

    subs r3, r3, #4
    ldr r0, [r1,r3]
    str r0, [r2,r3]
    bgt .L_loop1

Optional: Clear .bss

Once the .data section contents are copied to RAM, there is an optional step for initializing the .bss section contents to 0. In our case, this code is not compiled. Newlib handles .bss initialization.

/* This part of work usually is done in C library startup code. Otherwise,
 * define __STARTUP_CLEAR_BSS to enable it in this startup. This section
 * clears the RAM where BSS data is located.
 * The BSS section is specified by following symbols
 *    __bss_start__: start of the BSS section.
 *    __bss_end__: end of the BSS section.
 * All addresses must be aligned to 4 bytes boundary.
    ldr r1, =__bss_start__
    ldr r2, =__bss_end__

    movs r0, 0

    subs r2, r2, r1
    ble .L_loop3_done

    subs r2, r2, #4
    str r0, [r1, r2]
    bgt .L_loop3

#endif /* __STARTUP_CLEAR_BSS */


Before invoking the C runtime startup routine, a SystemInit function is called. This function, which we will look at next, is responsible for initializing the processor and applying behavioral fixes for relevant errata.

bl SystemInit

Call _start

Once the processor is initialized, we call the _start function to initialize the C runtime. Note that the nRF startup code allows you to define a custom entry point with a compiler definition.

/* Call _start function provided by libraries.  If those libraries 
 * are not accessible, define __START as your entry point. */
#ifndef __START
#define __START _start
    bl __START

IRQ Handlers

The gcc_startup_nrf52840.S also contains dummy exception handler function definitions. For example:

.weak   NMI_Handler
    .type   NMI_Handler, %function
    b       .
    .size   NMI_Handler, . - NMI_Handler

    .weak   HardFault_Handler
    .type   HardFault_Handler, %function
    b       .
    .size   HardFault_Handler, . - HardFault_Handler

A default handler is declared, which performs an infinite loop:

.globl  Default_Handler
    .type   Default_Handler, %function
    b       .
    .size   Default_Handler, . - Default_Handler

All other IRQ handlers are mapped to this default handler. Users are able to overwrite these handlers with their own implementations as needed.

.macro  IRQ handler
.weak   \handler
.set    \handler, Default_Handler


/// ...

After the IRQ handlers are supplied, the file ends.


nRF52 System Initialization

The SystemInit function is implemented in system_nrf52840.c (found in the nRF SDK). For a normal application, this file would be modified to suit the platform's requirements. We'll look at the default implementation for our processor.

First, SWO trace functionality is enabled in the processor. If ENABLE_SWO is not defined, the pin is left as normal GPIO.

#if defined (ENABLE_SWO)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;

Next, Trace functionality is enabled in the processor. If ENABLE_TRACE is not defined, the pins are left as normal GPIO.

#if defined (ENABLE_TRACE)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;

// ... more pin configurations in the actual implementation


Following debug configuration, the system checks for a variety of errata conditions and applies fixes as necessary. Here are a few examples:

/* Workaround for Errata 98 "NFCT: Not able to communicate with the peer"  */
if (errata_98()){
    *(volatile uint32_t *)0x4000568Cul = 0x00038148ul;

/* Workaround for Errata 103 "CCM: Wrong reset value of CCM MAXPACKETSIZE"  */
if (errata_103()){

Following the errata section, the FPU is initialized if the program has been compiled with floating point support. The __FPU_USED macro is supplied by the compiler.

#if (__FPU_USED == 1)
    SCB->CPACR |= (3UL << 20) | (3UL << 22);

If NFC is not used for an nRF52 platform, the associated NFC pins are configured as normal GPIO.

        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}

The nRF allows a GPIO to be configured as a reset pin. If CONFIG_GPIO_AS_PINRESET is defined, a dedicated GPIO will be configured to act as a reset pin.

        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[0] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[1] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}

Finally, the system clock is initialized:


Newlib ARM Startup

After data has been relocated and the processor properly initialized, the reset handler calls the _start function. For our GCC ARM application, this function is supplied by Newlib.

The Newlib project is divided into two major parts: newlib and libgloss. The newlib portion is an implementation of libc and libm. The libgloss portion contains platform-specific code, such as startup files, board support packages, and I/O support for the C library.

When exploring the Newlib code base on your own, it is important to note the distinction between libgloss and newlib. The libgloss division happened after the inception of the Newlib project. Many of the same files are found in the newlib folder and the libgloss folder. For platform-specific code, you should prefer the libgloss implementations. These are newer, and the older implementations remain in the newlib folder for backwards compatibility with older targets.


The _start function for the ARM architecture is found in libgloss/arm/crt0.S

The _start function is quite lengthy, so I will be providing highlights of the full implementation. The startup code presented below has also simplified from the code found in crt0.S. The full implementation supports semi-hosting, where a debugger handles parts of the standard library functionality. I've removed the monitor-related code to simplify our current review.

Newlib implements a single runtime that supports both ARM and Thumb modes. This can be confusing, since not all operations apply to both modes. Because we are using a Cortex-M processor (the nRF52), the program is compiled entirely in Thumb mode. Some startup code only applies when ARM mode is enabled, and I will highlight this as best as I can.

The file opens with preprocessor definitions, logic for selecting the proper ARM/Thumb architecture, and a declaration of the _start`` function. The most important preprocessor entry for our current exploration isHAVE_INITFINI_ARRAY` selection logic.

#define _init   __libc_init_array
#define _fini   __libc_fini_array

When HAVE_INITFINI_ARRAY is defined, the _init and _fini function calls will be exchanged with __libc_init_array and __libc_fini_array respectively. This macro comes into play - our ARM program uses the .init_array and .fini_array sections.

We should also note an assembly macro which we will encounter in the startup code: indirect_call.

.macro indirect_call reg
    blx \reg
    mov lr, pc
    mov pc, \reg

The indirect_call is used to mimic blx behavior for architectures that do not support that instruction, as described in the summary of the ARM Procedure Call Standard.

We eventually reach the proper beginning of the _start function, which is aliased as _mainCRTStartup:

FUNC_START  _mainCRTStartup
    FUNC_START  _start
#if defined(__ELF__) && !defined(__USING_SJLJ_EXCEPTIONS__)
    /* Annotation for EABI unwinding tables.  */

Stack Setup

The first order of business is to set up the stacks for the various ARM processor modes.

The linker script may provide the stack address with the __stack symbol, which is then made accessible to the assembly via .Lstack:

    .word   __stack

The stack address is loaded and checked to make sure it is a non-zero value:

ldr r3, .Lstack
cmp r3, #0

If the __stack symbol is not defined, the alternate value provided in the .LC0 variable is used instead:

#ifdef __thumb2__
    it  eq
#ifdef THUMB1_ONLY
    bne .LC28
    ldr r3, .LC0
    ldreq   r3, .LC0

Once the stack address is loaded into r3, we work through the various user modes and set up stacks and stack limits. This operation only applies to programs compiled in ARM mode, bceause Thumb has no concept of user modes.

If the processor is already operating in user mode, or if Thumb mode is being used, this section is skipped. Our Cortex-M-based nRF52 only uses Thumb mode, so this section is skipped.

/* Note: This 'mov' is essential when starting in User, and ensures we
         always get *some* sp value for the initial mode, even if we
         have somehow missed it below (in which case it gets the same
         value as FIQ - not ideal, but better than nothing.) */
    mov sp, r3
    /* XXX Fill in stack assignments for interrupt modes.  */
    mrs r2, CPSR
    tst r2, #0x0F   /* Test mode bits - in User of all are 0 */
    beq .LC23       /* "eq" means r2 AND #0x0F is 0 */
    msr     CPSR_c, #0xD1   /* FIRQ mode, interrupts disabled */
    mov     sp, r3
    sub sl, sp, #0x1000 /* This mode also has its own sl (see below) */

    mov r3, sl
    msr     CPSR_c, #0xD7   /* Abort mode, interrupts disabled */
    mov sp, r3
    sub r3, r3, #0x1000

    msr     CPSR_c, #0xDB   /* Undefined mode, interrupts disabled */
    mov sp, r3
    sub r3, r3, #0x1000

    msr     CPSR_c, #0xD2   /* IRQ mode, interrupts disabled */
    mov sp, r3
    sub r3, r3, #0x2000

    msr     CPSR_c, #0xD3   /* Supervisory mode, interrupts disabled */

    mov sp, r3
    sub r3, r3, #0x8000 /* Min size 32k */
    bic r3, r3, #0x00FF /* Align with current 64k block */
    bic r3, r3, #0xFF00

    str r3, [r3, #-4]   /* Move value into user mode sp without */
    ldmdb   r3, {sp}^       /* changing modes, via '^' form of ldm */
    orr r2, r2, #0xC0   /* Back to original mode, presumably SVC, */
    msr CPSR_c, r2  /* with FIQ/IRQ disable bits forced to 1 */

Note that setting up each mode is currently not performed for Thumb code. Only the user mode stack is initialized for thumb programs. That's why we did not observe this setup code in our disassembly of _start.

The last portion of the stack setup process puts an arbitrary stack limit in place. Unlike the __stack definition which is provided by the linker, the stack limit is an arbitrarily decided value of 64kB. This may be problematic if we have a larger stack or if the stack runs into the heap.

#ifdef THUMB1_ONLY
    movs    r2, #64
    lsls    r2, r2, #10
    subs    r2, r3, r2
    mov sl, r2
    sub sl, r3, #64 << 10   /* Still assumes 256bytes below sl */

Initialize .bss

Once our stack is set up, the .bss sections I cleared. The .bss section start and end addresses are made available through the .LC1 and .LC2 variables:

    .word   __bss_start__
    .word   __bss_end__

The arguments to memset are loaded into registers, and the size is calculated:

/* Zero the memory in the .bss section.  */
    movs    a2, #0          /* Second arg: fill value */
    mov fp, a2          /* Null frame pointer */
    mov r7, a2          /* Null frame pointer for Thumb */

    ldr a1, .LC1        /* First arg: start of memory block */
    ldr a3, .LC2
    subs    a3, a3, a1      /* Third arg: length of block */

Once the arguments are loaded, we call memset (and switch to Thumb mode if appropriate):

#if __thumb__ && !defined(PREFER_THUMB)
    /* Enter Thumb mode.... */
    add a4, pc, #1  /* Get the address of the Thumb block */
    bx  a4      /* Go there and start Thumb decoding  */

    .code 16
    .global __change_mode

    bl  FUNCTION (memset)

Target-Specific Initialization

Once the .bss section is cleared, optional target-specific early initialization is performed.

The startup code supports two weakly-linked functions:

.weak FUNCTION (hardware_init_hook)
    .weak FUNCTION (software_init_hook)

They are weakly-linked because they are optional. If a platform does not require this functionality the functions will not be defined and a value of 0 will be loaded for the variable. These functions are made available via the .Lhwinit and .Lswinit variables:

    .word   FUNCTION (hardware_init_hook)
    .word   FUNCTION (software_init_hook)

The startup code checks whether these functions are defined, and calls them if they are.

ldr r3, .Lhwinit
    cmp r3, #0
    beq .LC24
    indirect_call r3
    ldr r3, .Lswinit
    cmp r3, #0
    beq .LC25
    indirect_call r3

argc and argv Initialization

The Newlib ARM startup code has a simple solution for argc and argv: they are initialized to 0:

    movs    r0, #0      /*  no arguments  */
    movs    r1, #0      /*  no argv either */

Call Global Constructors

Next, we call global constructors. The code is provisioned such that it will work If global constructors are not present. Constructors are enabled in our configuration.

First, we store the values of r0 and r1 to r4 and r5, since we will be calling other functions:

movs    r4, r0
    movs    r5, r1

First, we will register the _fini function (which is actually __libc_fini_array thanks to the preprocessor) with atexit. This ensures that global destructors will be run when exiting the program.

Newlib supports a "light exit" implementation, which is controlled by the _LITE_EXIT compiler definition. For embedded systems, this is a wonderful option. Our programs do not perform normal exit procedures; they simply run until power is removed. Cleaning up after the program is not a requirement, and exit functions can be discarded.

If _LITE_EXIT is enabled, atexit is weakly linked. If atexit is linked in our application, it will be called with __libc_fini_array as an argument. If it is not defined, the global destructors will not be registered. Our current configuration is using _LITE_EXIT without atexit.

#ifdef _LITE_EXIT
    /* Make reference to atexit weak to avoid unconditionally pulling in
       support code.  Refer to comments in __atexit.c for more details.  */
    .weak   FUNCTION(atexit)
    ldr r0, .Latexit
    cmp r0, #0
    beq .Lweak_atexit
    ldr r0, .Lfini
    bl  FUNCTION (atexit)

After the global destructors are registered, the _init function is invoked (which is actually __libc_init_array thanks to the preprocessor). This function calls the global constructors, and it is always run.

    bl  FUNCTION (_init)

Once we have called the global constructors, the values for argc and argv are moved into the function argument registers r0 and r1 so we can call main:

movs    r0, r4
    movs    r1, r5

Call main

With the argc and argv function arguments stored in r0 and r1, we can safely call main:

bl  FUNCTION (main)

Program Exit

After main returns, exit is called using its return code. We do not expect exit to return, but if it does then we trap the program in SWI_Exit.

bl  FUNCTION (exit)     /* Should not return.  */

#if __thumb__ && !defined(PREFER_THUMB)
    /* Come out of Thumb mode.  This code should be redundant.  */

    mov a4, pc
    bx  a4

    .code 32
    .global change_back
    /* Halt the execution.  This code should never be executed.  */
    /* With no debug monitor, this probably aborts (eventually).
       With a Demon debug monitor, this halts cleanly.
       With an Angel debug monitor, this will report 'Unknown SWI'.  */
    swi SWI_Exit

Now that we've looked over the _start function, let's look at the various functions that _start called.


The __libc_init_array() function can be found in newlib/libc/misc/init.c.

Depending on the architecture, compiler, and linker, constructors are placed into the .init_array section or the .init section. The Newlib ARM startup code is flexible and can handle any combination of cases. If HAVE_INITFINI_ARRAY is not defined, _start calls _init directly instead of calling __libc_init_array. If HAVE_INITFINI_ARRAY is defined, __libc_init_array calls the constructors in the .preinit_array and .init_array sections. If .init is also present for an architecture, the constructors stored in that section will also be invoked.

ARM code typically uses the __init_array instead of _init. In our current case, HAVE_INITFINI_ARRAY is defined and HAVE_INIT_FINI is not.

/* Handle ELF .{pre_init,init,fini}_array sections.  */
#include <sys/types.h>


/* These magic symbols are provided by the linker.  */
extern void (*__preinit_array_start []) (void) __attribute__((weak));
extern void (*__preinit_array_end []) (void) __attribute__((weak));
extern void (*__init_array_start []) (void) __attribute__((weak));
extern void (*__init_array_end []) (void) __attribute__((weak));

extern void _init (void);

/* Iterate over all the init routines.  */
__libc_init_array (void)
  size_t count;
  size_t i;

  count = __preinit_array_end - __preinit_array_start;
  for (i = 0; i < count; i++)
    __preinit_array_start[i] ();

  _init ();

  count = __init_array_end - __init_array_start;
  for (i = 0; i < count; i++)
    __init_array_start[i] ();


The __libc_fini_array() function can be found in newlib/libc/misc/fini.c.

Depending on the architecture, compiler, and linker, destructors are placed into the .fini_array section or the .fini section. If the program is configured with full exit support, these functions will be executed before the program exits. In a LITE_EXIT configuration, the destructors are ignored.

Like __libc_init_array, the functionality is decided by two macros. If HAVE_INITFINI_ARRAY is not defined, _start registers _fini with atexit instead of __libc_fini_array. If HAVE_INITFINI_ARRAY is defined, the __libc_fini_array function is registered. When __libc_fini_array is invoked by exit, it calls the destructors in the .fini_array section. If .fini is also present for an architecture, the constructors stored in that section will also be invoked.

ARM code typically uses the __fini_array instead of _fini. In our current case, HAVE_INITFINI_ARRAY is defined and HAVE_INIT_FINI is not.

/* Handle ELF .{pre_init,init,fini}_array sections.  */
#include <sys/types.h>

extern void (*__fini_array_start []) (void) __attribute__((weak));
extern void (*__fini_array_end []) (void) __attribute__((weak));

extern void _fini (void);

/* Run all the cleanup routines.  */
__libc_fini_array (void)
  size_t count;
  size_t i;

  count = __fini_array_end - __fini_array_start;
  for (i = count; i > 0; i--)
    __fini_array_start[i-1] ();

  _fini ();

Heap Limit and malloc

The __heap_limit variable set during the _start routine is used by _sbrk, found in libgloss/arm/syscalls.c.

The _sbrk function is used to allocate memory for the platform. For more information heap allocation and sbrk, read this article about the glibc heap implementation.

While the _sbrk function is not directly used in the startup code, we can see that setting __heap_limit during _start is effectively configuring the program's heap. If the _start routine does not update __heap_limit, the default value is recognized and there will be no detection for allocations reaching beyond the heap limit.

/* Heap limit returned from SYS_HEAPINFO Angel semihost call.  */
uint __heap_limit = 0xcafedead;

void * __attribute__((weak))
_sbrk (ptrdiff_t incr)
  extern char end asm ("end"); /* Defined by the linker.  */
  static char * heap_end;
  char * prev_heap_end;

  if (heap_end == NULL)
    heap_end = & end;

  prev_heap_end = heap_end;

  if ((heap_end + incr > stack_ptr)
      /* Honour heap limit if it's valid.  */
      || (__heap_limit != 0xcafedead && heap_end + incr >
         (char *)__heap_limit))
      errno = ENOMEM;
      return (void *) -1;

  heap_end += incr;

  return (void *) prev_heap_end;

atexit Family

The atexit family of functions is responsible for registering functions to be called when the program exits, including the global destructors. We will explore the following functions:

We don't typically need exit functionality for our embedded platforms. Rarely is there a concept of a program "exit" which requires cleanup of resources. Instead, our programs run until they are terminated by a reset, off switch, or our of power.

Newlib provides for this behavior through the _LITE_EXIT compilation option. This option changes behavior related to the exit-time requirements and reduces our binary size. Our program is technically compiled under _LITE_EXIT, but we will still analyze the normal exit-related behavior for instructional purposes.

The Newlib code comments are helpful in explaining the differences between the two exit configurations. Under normal circumstances, we can expect the following exit call graphs ( an -> indicates "invokes"):

Default (without lite exit) call graph is like:
 *  _start -> atexit -> __register_exitproc
 *  _start -> __libc_init_array -> __cxa_atexit -> __register_exitproc
 *  on_exit -> __register_exitproc
 *  _start -> exit -> __call_exitprocs

When lite exit is enabled, the call graph changes. The atexit, __register_exitproc, and __call_exitprocs functions are changed to weak symbols, which may not be linked by the final program. These function call stacks are modified:

Lite exit makes some of above calls as weak reference, so that size
expansive  functions __register_exitproc and __call_exitprocs may 
not be linked. These calls are:
 *    _start w-> atexit
 *    __cxa_atexit w-> __register_exitproc
 *    exit w-> __call_exitprocs

Let's look at how these exit functions operate.


The atexit function is used to register calls that should be invoked when the program exits. Most notably, this call is used to register the function in .fini or .fini_array during the startup process. If the _LITE_EXIT configuration is used, this function step will be avoided.

The atexit function is implemented in newlib/libc/stdlib/atexit.c. This implementation forwards the input function argument to __register_exitproc while noting that the call originated from atexit (using the __et_atexit argument).

#include <stdlib.h>
#include "atexit.h"

atexit (void (*fn) (void))
  return __register_exitproc (__et_atexit, fn, NULL, NULL);


The __cxa_atexit call is used similarly to atexit, but often for handling functions to be called when a dynamic library is unloaded. In many implementations, such as this one, atexit and __cxa_atexit share implementations.

The __cxa_atexit function is implemented in newlib/libc/stdlib/cxa_atexit.c. This implementation forwards the input function and arguments to __register_exitproc while indicating that the call originated from __cxa_atexit (using the __et_cxa argument).

If the _LITE_EXIT configuration is used, then __register_exitproc may be weakly linked. In this case, __cxa_atexit will blindly return success (0).

int __cxa_atexit (void (*fn) (void *), void *arg, void *d)
#ifdef _LITE_EXIT
  /* Refer to comments in __atexit.c for more details of lite exit.  */
  int __register_exitproc (int, void (*fn) (void), void *, void *)
    __attribute__ ((weak));

  if (!__register_exitproc)
    return 0;
    return __register_exitproc (__et_cxa, (void (*)(void)) fn, arg, d);


We've seen two uses of __register_exitproc, the common routine that handles all atexit-like functionality. __register_exitproc is called when the program exits or when a shared library is unloaded.

The __register_exitproc function is implemented in newlib/libc/stdlib/__atexit.c. This function must support a variety of configurations and behaviors: _LITE_EXIT vs standard exit, single-threaded vs multi-threaded, atexit vs __cxa_atexit. I've stripped out some of the #ifdef blocks to make the code more readable.

The function starts by acquiring a lock if threading is enabled:

int __register_exitproc (int type, void (*fn) (void), void *arg, void *d)
  struct _on_exit_args * args;
  register struct _atexit *p;

#ifndef __SINGLE_THREAD__

And we grab our _GLOBAL_ATEXIT list of functions. If this has not been initialized yet, we assign it to the initial list value.

if (p == NULL)

By default, atexit requires the C runtime to support registering at least 32 functions (_ATEXIT_SIZE). Newlib handles this by allocating 32-chunk blocks of memory. Once the current block is full, a new block will be allocated and added to the list

If there is no malloc implementation for the system, or if dynamic allocations for atexit are not allowed, the function will fail and return an error code instead of allocating a new block.

if (p->_ind >= _ATEXIT_SIZE)
#if !defined (_ATEXIT_DYNAMIC_ALLOC) || !defined (MALLOC_PROVIDED)
#ifndef __SINGLE_THREAD__
      return -1;
      p = (struct _atexit *) malloc (sizeof *p);
      if (p == NULL)
#ifndef __SINGLE_THREAD__
      return -1;
      p->_ind = 0;
      p->_next = _GLOBAL_ATEXIT;
      _GLOBAL_ATEXIT = p;
      p->_on_exit_args_ptr = NULL;

We observed two different type values for this call: __et_atexit and __et_cxa. If __cxa_atexit was called, additional arguments were provided and need to be stored for future retrieval. Arguments and function pointers are stored in the current index, and then it is incremented.

if (type != __et_atexit)
    args = &p->_on_exit_args;
    args->_fnargs[p->_ind] = arg;
    args->_fntypes |= (1 << p->_ind);
    args->_dso_handle[p->_ind] = d;
    if (type == __et_cxa)
        args->_is_cxa |= (1 << p->_ind);
p->_fns[p->_ind++] = fn;

Once we are done, we can unlock and exit the function:

#ifndef __SINGLE_THREAD__
  return 0;

Automatic Registration of Destructors

One interesting note is that Newlib provides features for registering global destructors (in .fini or .fini_array) within the C library, rather than in startup code. This automatic registration code is provided in newlib/libc/stdlib/__call_atexit.c.

A __libc_fini symbol is weakly defined. You can define __libc_fini to _fini or _fini_array in your linker script, and the C library will handle the registration so that your startup code does not need to call atexit.

extern char __libc_fini __attribute__((weak));

A registration function is defined and marked as a high-priority constructor, which places it into the .init or .init_array section. Since destructors are stored in LIFO order, and the .fini and .fini_array functions should run last, the constructor is attempting to be the first to register with atexit.

static void register_fini(void) __attribute__((constructor (0)));

The register function checks for a valid __libc_fini symbol and registers the destructors if its defined.

static void 
  if (&__libc_fini) {
    extern void __libc_fini_array (void);
    atexit (__libc_fini_array);
    extern void _fini (void);
    atexit (_fini);

exit Family

To complete our analysis of _start and crt0.s, we'll look at the exit family of functions:


The exit function is implemented in newlib/libc/stdlib/exit.c.

The Newlib exit function is a wrapper. exit calls all registered exit-time functions via __call_exitprocs. If the _LITE_EXIT configuration is used, this function may not be defined.

Following the invocation of exit-time destructors, a _GLOBAL_REEINT->__cleanup function is called. This function flushes stdio buffers, if necessary.

Once all destruction and cleanup activities are complete, control proceeds to _exit.

void exit (int code)
#ifdef _LITE_EXIT
  /* Refer to comments in __atexit.c for more details of lite exit.  */
  void __call_exitprocs (int, void *) __attribute__((weak));
  if (__call_exitprocs)
    __call_exitprocs (code, NULL);

  if (_GLOBAL_REENT->__cleanup)
    (*_GLOBAL_REENT->__cleanup) (_GLOBAL_REENT);
  _exit (code);


The __call_exitprocs function is responsible for calling exit-time destructor routines that were registered with the atexit famil of functions. __call_exitprocs is implemented in newlib/libc/stdlib/__call_atexit.c. I've stripped out some of the #ifdef blocks to make the code more readable.

The function starts by acquiring a lock if threading is enabled:

void  __call_exitprocs (int code, void *d)
  register struct _atexit *p;
  struct _atexit **lastp;
  register struct _on_exit_args * args;
  register int n;
  int i;
  void (*fn) (void);

#ifndef __SINGLE_THREAD__

Next the linked-list of exit-time functions is accessed. Note the restart label, as it will be referenced later.

  lastp = &_GLOBAL_ATEXIT;

For each entry in the list, the following actions are performed:

  • Arguments are loaded
  • The function is removed from the list
  • The index is decremented
  • If unloading a shared library, check that the _dso_handle matches the unloaded library
    • Skip to the next entry if there is a mismatch
  • Check if the function has been called
    • Skip to the next entry if it has already been called
  • Call the function

The loop also checks the index after calling the destructor. If that function registered new exit-time functions, the loop jumps back to restart to ensure to preserve the destructor LIFO order.

while (p)
      args = &p->_on_exit_args;
      for (n = p->_ind - 1; n >= 0; n--)
      int ind;

      i = 1 << n;

      /* Skip functions not from this dso.  */
      if (d && (!args || args->_dso_handle[n] != d))

      /* Remove the function now to protect against the
         function calling exit recursively.  */
      fn = p->_fns[n];
      if (n == p->_ind - 1)
        p->_fns[n] = NULL;

      /* Skip functions that have already been called.  */
      if (!fn)

      ind = p->_ind;

      /* Call the function.  */
      if (!args || (args->_fntypes & i) == 0)
        fn ();
      else if ((args->_is_cxa & i) == 0)
        (*((void (*)(int, void *)) fn))(code, args->_fnargs[n]);
        (*((void (*)(void *)) fn))(args->_fnargs[n]);

      /* The function we called call atexit and registered another
         function (or functions).  Call these new functions before
         continuing with the already registered functions.  */
      if (ind != p->_ind || *lastp != p)
        goto restart;
    } // end of for - while still in effect

At the end of each block of exit-functions, the now-empty block is removed from the list and the memory is freed. If malloc is not provided or dynamic allocations in atexit are disallowed, the function ends after the first block.

// while still in effect
#if !defined (_ATEXIT_DYNAMIC_ALLOC) || !defined (MALLOC_PROVIDED)
      /* Move to the next block.  Free empty blocks except the last one,
     which is part of _GLOBAL_REENT.  */
      if (p->_ind == 0 && p->_next)
      /* Remove empty block from the list.  */
      *lastp = p->_next;
      free (p);
      p = *lastp;
      lastp = &p->_next;
      p = p->_next;
    } // end of while

The lock is released, and the function exits.

#ifndef __SINGLE_THREAD__


The _exit function is found at libgloss/arm/_exit.c. This function is simply a wrapper around _kill_shared.

void _exit (int status)
  /* The same SWI is used for both _exit and _kill.
     For _exit, call the SWI with "reason" set to 
    ADP_Stopped_ApplicationExit to mark a standard exit.
     Note: The RDI implementation of _kill_shared throws away all its
     arguments and all implementations ignore the first argument.  */
  _kill_shared (-1, status, ADP_Stopped_ApplicationExit);


The _kill_shared function is implemented in libgloss/arm/_kill.c.

When we remove the Semihosting / debug montior suport, this function does nothing:

int _kill_shared (int pid, int sig, int reason)
  (void) pid; (void) sig;


When debug monitor support is included, the __builtin_unreachable() call makes sense, because the debug monitor will trap the code in an SWI handler. If we have compiled without debug monitor support, this function will return up the call stack to crt0.s, and we will invoke the SWI handler anyway:

swi    SWI_Exit

Visual Summary

Startup Activity Checklist

In the first article of this series, we reviewed a broad range of startup activities that occur before main is called.

Here is a checklist of actions that were observed in the Newlib ARM program startup procedures:

  • [x] Early low-level initialization of the processor/hardware
  • [x] Stack initialization
  • [x] Frame pointer initialization
  • [x] C/C++ runtime setup
    • [x] Handle relocations (some sections are copied from flash to RAM)
    • [x] Initialize .bss
    • [x] Call global constructors
    • [x] Prepare argc, argv (set to 0)
    • [ ] Prepare environment variables
    • [x] Heap initialization
    • [ ] stdio initialization
    • [ ] Initialize exception support
    • [x] Register destructors and other exit-time functionality
  • [ ] System scaffolding setup
    • [ ] Threading support
    • [ ] Thread local storage
    • [ ] Buffer overrun detection
    • [ ] Run-time error checks
    • [ ] Locale settings
    • [ ] Math error handling
    • [ ] Math precision
  • [x] Jump to main
  • [x] Exit after main

Related Articles

Demystifying ARM Floating Point Compiler Options

When I first started bringing up new ARM platforms, I was pretty confused by the various floating point options such as -mfloat-abi=softfp or -mfpu=fpv4-sp-d16. I imagine this is confusing to other developers as well, so I'd like to share my ARM floating-point cheat sheet with the world.

An Overview of the ARM Floating-Point Architecture

Before we dive into compiler options, there are a few ARM floating-point details we should familiarize ourselves with: the ARM EABI, VFP, and NEON.


An ABI is a specification which defines the rules that a generated program must follow to work with a specific platform or interface. The ARM EABI defines the rules for an ARM platform, and your compiler will build your program according to those rules.

The ARM EABI specification defines two incompatible ABIs: one which uses floating-point registers for function arguments, and another which does not. If we are not using hardware floating-point operations, we can simply build our program without using the floating-point compatible ABI.

Since ARM defines a standard floating-point instruction set, we can still utilize the floating-point ABI even if our chip does not support the actual hardware. If floating-point hardware is not present, the instructions will be trapped and executed by a floating-point emulation module instead. The only real difference in functionality is slower execution speed when using software emulation.

Since the ABI defines interfaces for our programs, we must compile and link all of our components and libraries using the same ABI.

Vector Floating Point (VFP)

Vector Floating Point (VFP) is the name for ARM's floating-point extension. Prior to ARMv8, VFP was implemented as a coprocessor extension. The VFP coprocessor supports both single and double-precision floating point operations according to the IEEE 754 standard. For practical purposes, VFP is not useful for vector operations and should be considered a normal scalar floating-point unit (FPU). VFP has been replaced with NEON as of ARMv8.

The VFP extensions are optional parts of the ARM architecture, though the majority of Cortex-A processors do provide a floating-point unit. Some Cortex-A8 devices may utilize a reduced VFPLite module instead of a full VFP module. This VFPLite module requires roughly a 10x increase in clock cycles per floating-point operation.

Some devices such as the ARM Cortex-A8 have a cut-down VFPLite module instead of a full VFP module, and require roughly ten times more clock cycles per float operation.[81]

VFP Versions

Here's a high-level summary of the different VFP versions that have been released throughout the years.

  • VFPv1
    • Obsoleted by ARM
  • VFPv2
    • 16 64-bit FPU registers
    • Optional extension to the ARM instruction set in the ARMv5TE, ARMv5TEJ, ARMv6, and ARMv6K architectures
    • Optional extension to the ARM and Thumb instruction set in the ARMv6T2 architecture
    • Supports standard FPU arithmetic (add, sub, neg, mul, div), full square root
  • VFPv3
    • Backwards compatible with VFPv2, except that it cannot trap floating-point exceptions
    • Adds VCVT instructions to convert between scalar, float and double
    • Adds immediate mode to VMOV such that constants can be loaded into FPU registers.
    • VFPv3-D32
      • 32 64-bit FPU registers
      • Implemented on most Cortex-A8 and A9 ARMv7 processors
    • VFPv3-D16
      • 16 64-bit FPU registers
      • Implemented on Cortex-R4 and R5 processors and the Tegra 2 (Cortex-A9).
    • VFPv3-F16
      • Uncommon
      • Supports IEEE754-2008 half-precision (16-bit) floating point as a storage format
    • VFPv3U
      • A variant of VFPv3 that supports the trapping of floating-point exceptions to support code.
      • Can support single- or half-precision floating point
  • VFPv4
    • Built on VFPv3
    • Adds half-precision support as a storage format
    • Adds fused multiply-accumulate instructions
    • VFPv4-D32
      • 32 64-bit FPU registers
      • Implemented on the Cortex-A12 and A15 ARMv7 processors
      • Cortex-A7 optionally has VFPv4-D32 (in the case of an FPU with NEON)
    • VFPv4-D16
      • 16 64-bit FPU registers
      • Implemented on Cortex-A5 and A7 processors (in case of an FPU without NEON)
    • VFPv4U
      • A variant of VFPv4 that supports the trapping of floating-point exceptions to support code
    • Can support single- or half-precision floating point
  • VFPv5
    • Implemented on Cortex-M7 when single and double-precision floating-point core option exists


NEON, the "Advanced Single Instruction Multiple Data (SIMD) Extension", is ARM's successor to the VFP coprocessor. NEON is a VFP extension which allows for efficient matrix and vector data manipulation and is commonly used in signal-processing applications. Prior to ARMv8, the ARM architecture distinguished between VFP and NEON floating-point support. NEON was not fully IEEE 754 compliant, and there were instructions that VFP supported which NEON did not. These issues have been resolved with ARMv8.

NEON sports a combined 64- and 128-bit SIMD instruction set and shares the same floating-pointer registers as used in VFP. Some devices, such as the Cortex-A8 and Cortex-A9 lines, support 128-bit vectors but operate on 64 bits at a time. Newer processors such as the Cortex-A15 can operate on 128 bits at a time.

NEON remains an optional part of the ARM architecture. However, NEON is included in all Cortex-A8 devices.


The Scalable Vector Extension (SVE) is the next-generation ARM SIMD instruction set. Currently it is only targeting ARMv8-A and the aarch64 ISA.

Compiler Options

Now that we have a high level understanding of ARM floating-point technologies, let's take a look at the compiler options we can use. I will be providing information relevant to the GNU and clang toolchains. For more information on the ARM compiler options, please see this reference documentation.

Let's dive into the two major compilation options: -mfloat-abi and -mfpu.


The -mfloat-abi=<name> option is used to select which ARM ABI is used. This option also controls whether floating-point instructions may be used.

Here are your float-abi options:

  • soft: full software floating-point support
  • softfp: Allows use of floating-point instructions but maintains compatibility with the soft-float ABI
  • hard: Uses floating-point instructions and the floating-point ABI.

Each target architecture has a default value which is used if no option is supplied.

Note well: the two ARM ABIs (hard-float and soft-float) are not link-compatible. Your entire program must be compiled using the same ABIs. If a pre-compiled library is not supplied with your target floating-point ABI, you will need to recompile it for your own purposes.


The soft option enables full software floating-point support. The compiler will not generate FPU instructions in soft mode. Instead, the compiler generates library calls to handle floating point operations. The compiler also generates prologue and epilogue functions to pass floating-point arguments (float, double) into integer registers (one for float, two fordouble`).

When using the soft option, the -mfpu flag is ignored.


The softfp option is a hybrid between hard and soft. The compiler is allowed to generate hardware floating-point instructions, but it still uses the soft-float ABI. Like with soft, the compiler generates functions to pass floating-point arguments to integer registers. Depending on the chosen FPU (-mfpu), the compiler can choose when to use emulated or hardware floating-point instructions.

Since both soft and softfp use the same soft-float ABI, code built with either option can be linked together. However, when copying data from integer to floating-point registers, a pipeline stall is incurred for every copy. This additional overhead can impact the performance of your application, since data is being copied back-and-forth from the FPU registers when using floating-point arguments.


The hard option enables full hardware floating-point support. The compiler generates floating-point instructions and uses the floating-point ABI. Floating-point function arguments are passed directly into FPU registers. Since there are no function prologue or epilogue requirements, no pipeline stalls are incurred with floating-point arguments. The hard float option will provide you with the highest performance, but does limit your compiled binary to the selected FPU.

When using the hard option, you must define an FPU using -mfpu.


When using the hard or softfp float-abi, you should specify the FPU type using the -mfpu compiler flag. This flag specifies what floating-point hardware (or emulation) is available on your target architecture. When using the soft-float ABI, fpu determines the format of the floating-point values.

The -mfpu=<name> option supports the following FPU types: vfp, vfpv3, vfpv3-fp16, vfpv3-d16, vfpv3-d16-fp16, vfpv3xd, vfpv3xd-fp16, neon, neon-fp16, vfpv4, vfpv4-d16, fpv4-sp-d16, neon-vfpv4, fp-armv8, neon-fp-armv8, and crypto-neon-fp-armv8.

Each of the FPU options corresponds to the floating-point architectures described above, and some options represent supersets. If you don't care about the specific VFP type, you can select supersets (vfp, neon). You can also generalize VFP versions as supersets (vfpv3, vfpv4).


The-mfp16-format=<name> option allows you to specify the format of the half-precision floating-point type (__fp16). Valid options are none, ieee, and alternative. The default option is none, meaning __fp16 is not defined.

For more information, see the GNU Half-Precision Floating Point documentation.

Performance Impacts

In general, applications relying on floating-point operations will benefit from using the hard-float ABI.

Debian has some notes on VFP performance improvements and cite a proof-of-concept Ubuntu build which noted significant performance improvements with floating-point heavy libraries.

Further Reading

Silicon Labs Blue Gecko Starter Kit

Silicon Labs provides a Blue Gecko Starter Kit to support Bluetooth 5 development. The Blue Gecko kit is built around the EFR32 SoC line. The starter kit is modularized to support a wide variety of radio daughter boards for easy prototyping and chip comparisons. This kit provides a "mainboard" with two radio daughter boards: EFR32BG13 and the EFR32BG1. Only the EFR32BG13 radio board supports the new Bluetooth 5 LE Coded and LE 2M PHYs.

The starter kit contains a few push buttons and a coin cell battery holder, but does not include other on-board peripherals. A wide variety of headers are supplied for your prototyping needs.

More on the EFR32 Blue Gecko Starter Kit:


About the EFR32 Line

Silicon Labs offers Bluetooth 5 support in the EFR32 Blue Gecko line of SoCs. Similar to the Nordic nRF52810, the EFR32 series is built upon a Cortex-M4 processor. The EFR32 line sports a whopping +19dBm of programmable output power in their beefiest configuration.

Unlike Nordic's nRF52 line, the EFR32 line has many different chip configurations. Also, not all EFR32 chips support the new 2M PHY and LE Coded PHY, so be sure to include those features in your search. Silicon Labs provides a full list of EFR32 SoCs, so you can find one that fits your needs exactly.

If you wish to evaluate other radio chips in the EFR32 line, Silicon Labs likely provides a module that interfaces with the mainboard.

Sample EFR32 Specifications using maximum values:

  • ARM Cortex-M4 Processor (up to 40MHz)
  • Up to 1MB of flash
  • Up to 256kB SRAM
  • Up to +19dBm output power
  • AES256/128 hardware accelerator
  • 12-bit ADC
  • Current DAC (4-bit)
  • Up to 4x analog comparators
  • Low-energy UART
  • Up to 4x USART (SPI, UART, I2S, IrDA)
  • Up to 2x I2C
  • Up to 65 GPIOs
  • On-chip balun

EFR32BG12P632F512FM38 Specifications (Blue Gecko Starter Kit):

  • ARM Cortex-M4 40 MHz Processor
  • 512kB Flash + 64kB SRAM
  • +10dBm output power
  • -103.3dBm receiver sensitivity
  • AES-128/256 hardware accelerator
  • 12-bit ADC
  • Current DAC (4-bit)
  • Up to 4x analog comparators
  • 4x UART Ports
  • 3x USART ports (SPI, UART, I2C)
  • 2x I2C ports
  • 31 GPIOs

More on EFR32: