Exploring Startup Implementations: Newlib (ARM)

Updated: 20190909

For most programmers, a C or C++ program's life begins at the main function. They are blissfully unaware of the hidden steps that happen between invoking a program and executing main. Depending on the program and the compiler, there are all kinds of interesting functions that get run before main, automatically inserted by the compiler and linker and invisible to casual observers.

Unfortunately for programmers who are curious about the program startup process, the literature on what happens before main is quite sparse.

Embedded Artistry has been hard at working creating a C++ embedded framework. The final piece of the puzzle was implementing program startup code. To aid in the design of our framework's boot process, I performed an exploratory survey of existing program startup implementations. My goal is to identify a general program startup model. I also want to provide a more comprehensive look into how our programs get to main.

In this six-part series, we will be investigating what it takes to get to main:

  1. A General Overview of What Happens Before main()
  2. Exploring Startup Implementations: Newlib (ARM)
  3. Exploring Startup Implementations: OS X
  4. Exploring Startup Implementations: Custom Embedded System with ThreadX
  5. Abstracting a Generic Flow for Getting to main
  6. Implementing our Generic Startup Flow

Now that we have a high-level understanding of how our programs get to main, we can explore real-world implementations of program startup code.

Today's analysis focuses on Newlib. If you build embedded applications for ARM using the GNU arm-none-eabi toolchain, your program is linked with Newlib startup code by default. Newlib supports multiple architectures, but we will focus exclusively on the ARM startup path.

If you are interested in exploring Newlib startup routines on your own, you can download the Newlib source code or browse the source code online.

The boot flow is quite complicated, and it's easy to get mentally lost. You can refer to the Visual Summary throughout the article for a visual representation of the startup procedure and call stack.

Table of Contents:

  1. ARM Procedure Call Standard
  2. System Configuration
  3. Initial Exploration
    1. Boot Path
    2. _start Disassembly
  4. nRF52 Initial Boot
    1. Load from Flash to RAM
    2. Optional: Clear .bss
    3. SystemInit
    4. Call start
    5. IRQ Handlers
  5. nRF52 System Initialization
  6. Newlib ARM Startup
    1. crt0.s
      1. Stack Setup
      2. Initialize .bss
      3. Target-Specific Initialization
      4. argc and argv Initialization
      5. Call Global Constructors
    2. __libc_init_array
    3. __libc_fini_array
    4. Heap Limit and malloc
    5. atexit Family
      1. atexit
      2. __cxa_atexit
      3. __register_exitproc
      4. Automatic Registration of Destructors
    6. exit Family
      1. exit
      2. _exit
      3. __call_exitprocs
      4. _kill
  7. Visual Summary
  8. Startup Activity Checklist
  9. Further Reading

ARM Procedure Call Standard

Since we are going to look at ARM assembly, we will need to familiarize ourselves with the basics of the Procedure Call Standard for ARM Applications.

There are sixteen 32-bit registers and a status register (CPSR) in the ARM and Thumb instruction sets:

  • r0 (aka a1) is Argument register 1 and a result register
  • r1 (aka a2) is Argument register 2 and a result register
  • r2 (aka a3) is Argument register 3
  • r3 (aka a4) is Argument register 4
  • r4 (aka v1) is Variable register 1
  • r5 (aka v2) is Variable register 2
  • r6 (aka v3) is Variable register 3
  • r7 (aka v4) is Variable register 4
  • r8 (aka v5) is Variable register 5
  • r9 usage changes depending on the platform
  • r10 (aka v7) is Variable register 7
  • r11 (aka v7) is Variable register 8
  • r12 is the IP special purpose register (intra-procedure-call scratch register)
  • r2` is the SP special register (stack pointer)
  • r14 is the LR special register (link register)
  • r15 is the PC special register (program counter)

The standard says the following for the argument registers (r0-r3):

The first four registers r0-r3 (a1-a4) are used to pass argument values into a subroutine and to return a result value from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).

We have multiple registers to hold the value of local variables:

Typically, the registers r4-r8, r10 and r11 (v1-v5, v7 and v8) are used to hold the values of a routine’s local variables. Of these, only v1-v4 can be used uniformly by the whole Thumb instruction set, but the AAPCS does not require that Thumb code only use those registers.

We must preserve specific registers when calling functions:

A subroutine must preserve the contents of the registers r4-r8, r10, r11 and SP (and r9 in PCS variants that designate r9 as v6)

ARM specifies that the stack pointer (SP) must always be aligned to a word boundary (i.e., sp % 4 == 0). For public interfaces, the stack must be aligned to a double-word boundary (i.e., sp % 8 == 0).

The least significant bit of a function address is an ARM/Thumb flag (1 == ARM, 0 == Thumb). This bit is set by the linker.

When we want to call a subroutine, we need to preserve the current function's persistent registers on the stack, store the return address in the LR register (so we know how to get back from our function), and change the PC to the subroutine address. ARM provides branching instructions which handle this process for us (e.g., bl, blx,bx`), although the process may still be performed manually.

Now, there are many details that we did not cover, but this basic overview provides enough details to understand some of the assembly that we will be analyzing. Particularly important to keep in mind: values put into r0-r3 represent arguments to functions, and values put into r4-r11 represent variables used in our current subroutine.

System Configuration

For this exploration, I used a Nordic nRF52840 Development Kit. The development kit has several examples provided by Nordic; I used the blinky program. I compiled and linked the program with the GNU ARM toolchain (version 8-2018-q4-major). The Nordic blinky program links against the Newlib libraries provided by the GNU ARM toolchain.

Because this is a Cortex-M processor, the program is compiled entirely in Thumb mode. We will discuss some aspects of the boot process which apply to Cortex-A processors that use ARM instructions.

Initial Exploration

Before we start blindly looking through the Newlib code base, we should do some initial exploration with our debugger as described in the last article.

To begin the investigation, I compiled the blinky example for the nRF52840 Development Kit (PCA10056 in the SDK parlance) in the "blank" configuration using the armgcc Makefile. I flashed the binary to the board with the nRF Connect Programmer,

First, let's start with a backtrace from main in an example program so we can see what code is run. Then we will look at the disassembly for the _start function that is provided by Newlib.

nRF52840 Boot

Our backtrace showed us that our journey begins in the Reset_Handler function in gcc_startup_nrf52840.S (found in the nRF SDK).

The file begins by providing for stack storage:

.section .stack
#if defined(__STARTUP_CONFIG)
    .align __STARTUP_CONFIG_STACK_ALIGNEMENT
    .equ    Stack_Size, __STARTUP_CONFIG_STACK_SIZE
#elif defined(__STACK_SIZE)
    .align 3
    .equ    Stack_Size, __STACK_SIZE
#else
    .align 3
    .equ    Stack_Size, 8192
#endif
    .globl __StackTop
    .globl __StackLimit
__StackLimit:
    .space Stack_Size
    .size __StackLimit, . - __StackLimit
__StackTop:
    .size __StackTop, . - __StackTop

There are also provisions for heap storage:

.section .heap
    .align 3
#if defined(__STARTUP_CONFIG)
    .equ Heap_Size, __STARTUP_CONFIG_HEAP_SIZE
#elif defined(__HEAP_SIZE)
    .equ Heap_Size, __HEAP_SIZE
#else
    .equ Heap_Size, 8192
#endif
    .globl __HeapBase
    .globl __HeapLimit
__HeapBase:
    .if Heap_Size
    .space Heap_Size
    .endif
    .size __HeapBase, . - __HeapBase
__HeapLimit:
    .size __HeapLimit, . - __HeapLimit

This file also contains a declaration of all interrupt vectors and their associated handlers. A small sample is shown:

.section .isr_vector
    .align 2
    .globl __isr_vector
__isr_vector:
    .long   __StackTop                  /* Top of Stack */
    .long   Reset_Handler
    .long   NMI_Handler
    .long   HardFault_Handler
    .long   MemoryManagement_Handler
    .long   BusFault_Handler
    .long   UsageFault_Handler

    /// ...

    .size __isr_vector, . - __isr_vector

We then find the declaration of Reset_Handler:

.text
    .thumb
    .thumb_func
    .align 1
    .globl Reset_Handler
    .type Reset_Handler, %function
Reset_Handler:

Load from Flash to RAM

First, the reset handler copies data from flash to RAM.

The data is copied from the address of the __etext symbol, which represents the end of the .text section in flash storage. The data is copied to the address indicated by the __data_start__ symbol, and the number of bytes copied is calculated by subtracting the __data_start__ address from __bss_start__, which indicates the beginning of the next section. As the nRF startup code explains, __bss_start__ is used so users can insert their own initialized data section before the .bss section. Using this logic, it will be copied to RAM without any changes from the user.

ldr r1, =__etext
    ldr r2, =__data_start__
    ldr r3, =__bss_start__

    subs r3, r3, r2
    ble .L_loop1_done

.L_loop1:
    subs r3, r3, #4
    ldr r0, [r1,r3]
    str r0, [r2,r3]
    bgt .L_loop1

Optional: Clear .bss

Once the .data section contents are copied to RAM, there is an optional step for initializing the .bss section contents to 0. In our case, this code is not compiled. Newlib handles .bss initialization.

.L_loop1_done:
/* This part of work usually is done in C library startup code. Otherwise,
 * define __STARTUP_CLEAR_BSS to enable it in this startup. This section
 * clears the RAM where BSS data is located.
 *
 * The BSS section is specified by following symbols
 *    __bss_start__: start of the BSS section.
 *    __bss_end__: end of the BSS section.
 *
 * All addresses must be aligned to 4 bytes boundary.
 */
#ifdef __STARTUP_CLEAR_BSS
    ldr r1, =__bss_start__
    ldr r2, =__bss_end__

    movs r0, 0

    subs r2, r2, r1
    ble .L_loop3_done

.L_loop3:
    subs r2, r2, #4
    str r0, [r1, r2]
    bgt .L_loop3

.L_loop3_done:
#endif /* __STARTUP_CLEAR_BSS */

SystemInit

Before invoking the C runtime startup routine, a SystemInit function is called. This function, which we will look at next, is responsible for initializing the processor and applying behavioral fixes for relevant errata.

bl SystemInit

Call _start

Once the processor is initialized, we call the _start function to initialize the C runtime. Note that the nRF startup code allows you to define a custom entry point with a compiler definition.

/* Call _start function provided by libraries.  If those libraries 
 * are not accessible, define __START as your entry point. */
#ifndef __START
#define __START _start
#endif
    bl __START

IRQ Handlers

The gcc_startup_nrf52840.S also contains dummy exception handler function definitions. For example:

.weak   NMI_Handler
    .type   NMI_Handler, %function
NMI_Handler:
    b       .
    .size   NMI_Handler, . - NMI_Handler

    .weak   HardFault_Handler
    .type   HardFault_Handler, %function
HardFault_Handler:
    b       .
    .size   HardFault_Handler, . - HardFault_Handler

A default handler is declared, which performs an infinite loop:

.globl  Default_Handler
    .type   Default_Handler, %function
Default_Handler:
    b       .
    .size   Default_Handler, . - Default_Handler

All other IRQ handlers are mapped to this default handler. Users are able to overwrite these handlers with their own implementations as needed.

.macro  IRQ handler
.weak   \handler
.set    \handler, Default_Handler
.endm

IRQ  POWER_CLOCK_IRQHandler
IRQ  RADIO_IRQHandler
IRQ  UARTE0_UART0_IRQHandler

/// ...

After the IRQ handlers are supplied, the file ends.

.end

nRF52 System Initialization

The SystemInit function is implemented in system_nrf52840.c (found in the nRF SDK). For a normal application, this file would be modified to suit the platform's requirements. We'll look at the default implementation for our processor.

First, SWO trace functionality is enabled in the processor. If ENABLE_SWO is not defined, the pin is left as normal GPIO.

#if defined (ENABLE_SWO)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Serial << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P1->PIN_CNF[0] = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);
#endif

Next, Trace functionality is enabled in the processor. If ENABLE_TRACE is not defined, the pins are left as normal GPIO.

#if defined (ENABLE_TRACE)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Parallel << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P0->PIN_CNF[7]  = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);

// ... more pin configurations in the actual implementation

#endif

Following debug configuration, the system checks for a variety of errata conditions and applies fixes as necessary. Here are a few examples:

/* Workaround for Errata 98 "NFCT: Not able to communicate with the peer"  */
if (errata_98()){
    *(volatile uint32_t *)0x4000568Cul = 0x00038148ul;
}

/* Workaround for Errata 103 "CCM: Wrong reset value of CCM MAXPACKETSIZE"  */
if (errata_103()){
    NRF_CCM->MAXPACKETSIZE = 0xFBul;
}

Following the errata section, the FPU is initialized if the program has been compiled with floating point support. The __FPU_USED macro is supplied by the compiler.

#if (__FPU_USED == 1)
    SCB->CPACR |= (3UL << 20) | (3UL << 22);
    __DSB();
    __ISB();
#endif

If NFC is not used for an nRF52 platform, the associated NFC pins are configured as normal GPIO.

#if defined (CONFIG_NFCT_PINS_AS_GPIOS)
    if ((NRF_UICR->NFCPINS & UICR_NFCPINS_PROTECT_Msk) == (UICR_NFCPINS_PROTECT_NFC << UICR_NFCPINS_PROTECT_Pos)){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->NFCPINS &= ~UICR_NFCPINS_PROTECT_Msk;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

The nRF allows a GPIO to be configured as a reset pin. If CONFIG_GPIO_AS_PINRESET is defined, a dedicated GPIO will be configured to act as a reset pin.

#if defined (CONFIG_GPIO_AS_PINRESET)
    if (((NRF_UICR->PSELRESET[0] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos)) ||
        ((NRF_UICR->PSELRESET[1] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos))){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[0] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[1] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

Finally, the system clock is initialized:

SystemCoreClockUpdate();

Newlib ARM Startup

After data has been relocated and the processor properly initialized, the reset handler calls the _start function. For our GCC ARM application, this function is supplied by Newlib.

The Newlib project is divided into two major parts: newlib and libgloss. The newlib portion is an implementation of libc and libm. The libgloss portion contains platform-specific code, such as startup files, board support packages, and I/O support for the C library.

When exploring the Newlib code base on your own, it is important to note the distinction between libgloss and newlib. The libgloss division happened after the inception of the Newlib project. Many of the same files are found in the newlib folder and the libgloss folder. For platform-specific code, you should prefer the libgloss implementations. These are newer, and the older implementations remain in the newlib folder for backwards compatibility with older targets.

Visual Summary


nRF52840 Boot

Our backtrace showed us that our journey begins in the Reset_Handler function in gcc_startup_nrf52840.S (found in the nRF SDK).

The file begins by providing for stack storage:

.section .stack
#if defined(__STARTUP_CONFIG)
    .align __STARTUP_CONFIG_STACK_ALIGNEMENT
    .equ    Stack_Size, __STARTUP_CONFIG_STACK_SIZE
#elif defined(__STACK_SIZE)
    .align 3
    .equ    Stack_Size, __STACK_SIZE
#else
    .align 3
    .equ    Stack_Size, 8192
#endif
    .globl __StackTop
    .globl __StackLimit
__StackLimit:
    .space Stack_Size
    .size __StackLimit, . - __StackLimit
__StackTop:
    .size __StackTop, . - __StackTop

There are also provisions for heap storage:

.section .heap
    .align 3
#if defined(__STARTUP_CONFIG)
    .equ Heap_Size, __STARTUP_CONFIG_HEAP_SIZE
#elif defined(__HEAP_SIZE)
    .equ Heap_Size, __HEAP_SIZE
#else
    .equ Heap_Size, 8192
#endif
    .globl __HeapBase
    .globl __HeapLimit
__HeapBase:
    .if Heap_Size
    .space Heap_Size
    .endif
    .size __HeapBase, . - __HeapBase
__HeapLimit:
    .size __HeapLimit, . - __HeapLimit

This file also contains a declaration of all interrupt vectors and their associated handlers. A small sample is shown:

.section .isr_vector
    .align 2
    .globl __isr_vector
__isr_vector:
    .long   __StackTop                  /* Top of Stack */
    .long   Reset_Handler
    .long   NMI_Handler
    .long   HardFault_Handler
    .long   MemoryManagement_Handler
    .long   BusFault_Handler
    .long   UsageFault_Handler

    /// ...

    .size __isr_vector, . - __isr_vector

We then find the declaration of Reset_Handler:

.text
    .thumb
    .thumb_func
    .align 1
    .globl Reset_Handler
    .type Reset_Handler, %function
Reset_Handler:

Load from Flash to RAM

First, the reset handler copies data from flash to RAM.

The data is copied from the address of the __etext symbol, which represents the end of the .text section in flash storage. The data is copied to the address indicated by the __data_start__ symbol, and the number of bytes copied is calculated by subtracting the __data_start__ address from __bss_start__, which indicates the beginning of the next section. As the nRF startup code explains, __bss_start__ is used so users can insert their own initialized data section before the .bss section. Using this logic, it will be copied to RAM without any changes from the user.

ldr r1, =__etext
    ldr r2, =__data_start__
    ldr r3, =__bss_start__

    subs r3, r3, r2
    ble .L_loop1_done

.L_loop1:
    subs r3, r3, #4
    ldr r0, [r1,r3]
    str r0, [r2,r3]
    bgt .L_loop1

Optional: Clear .bss

Once the .data section contents are copied to RAM, there is an optional step for initializing the .bss section contents to 0. In our case, this code is not compiled. Newlib handles .bss initialization.

.L_loop1_done:
/* This part of work usually is done in C library startup code. Otherwise,
 * define __STARTUP_CLEAR_BSS to enable it in this startup. This section
 * clears the RAM where BSS data is located.
 *
 * The BSS section is specified by following symbols
 *    __bss_start__: start of the BSS section.
 *    __bss_end__: end of the BSS section.
 *
 * All addresses must be aligned to 4 bytes boundary.
 */
#ifdef __STARTUP_CLEAR_BSS
    ldr r1, =__bss_start__
    ldr r2, =__bss_end__

    movs r0, 0

    subs r2, r2, r1
    ble .L_loop3_done

.L_loop3:
    subs r2, r2, #4
    str r0, [r1, r2]
    bgt .L_loop3

.L_loop3_done:
#endif /* __STARTUP_CLEAR_BSS */

SystemInit

Before invoking the C runtime startup routine, a SystemInit function is called. This function, which we will look at next, is responsible for initializing the processor and applying behavioral fixes for relevant errata.

bl SystemInit

Call _start

Once the processor is initialized, we call the _start function to initialize the C runtime. Note that the nRF startup code allows you to define a custom entry point with a compiler definition.

/* Call _start function provided by libraries.  If those libraries 
 * are not accessible, define __START as your entry point. */
#ifndef __START
#define __START _start
#endif
    bl __START

IRQ Handlers

The gcc_startup_nrf52840.S also contains dummy exception handler function definitions. For example:

.weak   NMI_Handler
    .type   NMI_Handler, %function
NMI_Handler:
    b       .
    .size   NMI_Handler, . - NMI_Handler

    .weak   HardFault_Handler
    .type   HardFault_Handler, %function
HardFault_Handler:
    b       .
    .size   HardFault_Handler, . - HardFault_Handler

A default handler is declared, which performs an infinite loop:

.globl  Default_Handler
    .type   Default_Handler, %function
Default_Handler:
    b       .
    .size   Default_Handler, . - Default_Handler

All other IRQ handlers are mapped to this default handler. Users are able to overwrite these handlers with their own implementations as needed.

.macro  IRQ handler
.weak   \handler
.set    \handler, Default_Handler
.endm

IRQ  POWER_CLOCK_IRQHandler
IRQ  RADIO_IRQHandler
IRQ  UARTE0_UART0_IRQHandler

/// ...

After the IRQ handlers are supplied, the file ends.

.end

nRF52 System Initialization

The SystemInit function is implemented in system_nrf52840.c (found in the nRF SDK). For a normal application, this file would be modified to suit the platform's requirements. We'll look at the default implementation for our processor.

First, SWO trace functionality is enabled in the processor. If ENABLE_SWO is not defined, the pin is left as normal GPIO.

#if defined (ENABLE_SWO)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Serial << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P1->PIN_CNF[0] = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);
#endif

Next, Trace functionality is enabled in the processor. If ENABLE_TRACE is not defined, the pins are left as normal GPIO.

#if defined (ENABLE_TRACE)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Parallel << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P0->PIN_CNF[7]  = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);

// ... more pin configurations in the actual implementation

#endif

Following debug configuration, the system checks for a variety of errata conditions and applies fixes as necessary. Here are a few examples:

/* Workaround for Errata 98 "NFCT: Not able to communicate with the peer"  */
if (errata_98()){
    *(volatile uint32_t *)0x4000568Cul = 0x00038148ul;
}

/* Workaround for Errata 103 "CCM: Wrong reset value of CCM MAXPACKETSIZE"  */
if (errata_103()){
    NRF_CCM->MAXPACKETSIZE = 0xFBul;
}

Following the errata section, the FPU is initialized if the program has been compiled with floating point support. The __FPU_USED macro is supplied by the compiler.

#if (__FPU_USED == 1)
    SCB->CPACR |= (3UL << 20) | (3UL << 22);
    __DSB();
    __ISB();
#endif

If NFC is not used for an nRF52 platform, the associated NFC pins are configured as normal GPIO.

#if defined (CONFIG_NFCT_PINS_AS_GPIOS)
    if ((NRF_UICR->NFCPINS & UICR_NFCPINS_PROTECT_Msk) == (UICR_NFCPINS_PROTECT_NFC << UICR_NFCPINS_PROTECT_Pos)){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->NFCPINS &= ~UICR_NFCPINS_PROTECT_Msk;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

The nRF allows a GPIO to be configured as a reset pin. If CONFIG_GPIO_AS_PINRESET is defined, a dedicated GPIO will be configured to act as a reset pin.

#if defined (CONFIG_GPIO_AS_PINRESET)
    if (((NRF_UICR->PSELRESET[0] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos)) ||
        ((NRF_UICR->PSELRESET[1] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos))){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[0] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[1] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

Finally, the system clock is initialized:

SystemCoreClockUpdate();

Updated: 20190909

For most programmers, a C or C++ program's life begins at the main function. They are blissfully unaware of the hidden steps that happen between invoking a program and executing main. Depending on the program and the compiler, there are all kinds of interesting functions that get run before main, automatically inserted by the compiler and linker and invisible to casual observers.

Unfortunately for programmers who are curious about the program startup process, the literature on what happens before main is quite sparse.

Embedded Artistry has been hard at working creating a C++ embedded framework. The final piece of the puzzle was implementing program startup code. To aid in the design of our framework's boot process, I performed an exploratory survey of existing program startup implementations. My goal is to identify a general program startup model. I also want to provide a more comprehensive look into how our programs get to main.

In this six-part series, we will be investigating what it takes to get to main:

  1. A General Overview of What Happens Before main()
  2. Exploring Startup Implementations: Newlib (ARM)
  3. Exploring Startup Implementations: OS X
  4. Exploring Startup Implementations: Custom Embedded System with ThreadX
  5. Abstracting a Generic Flow for Getting to main
  6. Implementing our Generic Startup Flow

Now that we have a high-level understanding of how our programs get to main, we can explore real-world implementations of program startup code.

Today's analysis focuses on Newlib. If you build embedded applications for ARM using the GNU arm-none-eabi toolchain, your program is linked with Newlib startup code by default. Newlib supports multiple architectures, but we will focus exclusively on the ARM startup path.

If you are interested in exploring Newlib startup routines on your own, you can download the Newlib source code or browse the source code online.

The boot flow is quite complicated, and it's easy to get mentally lost. You can refer to the Visual Summary throughout the article for a visual representation of the startup procedure and call stack.

Table of Contents:

  1. ARM Procedure Call Standard
  2. System Configuration
  3. Initial Exploration
    1. Boot Path
    2. _start Disassembly
  4. nRF52 Initial Boot
    1. Load from Flash to RAM
    2. Optional: Clear .bss
    3. SystemInit
    4. Call start
    5. IRQ Handlers
  5. nRF52 System Initialization
  6. Newlib ARM Startup
    1. crt0.s
      1. Stack Setup
      2. Initialize .bss
      3. Target-Specific Initialization
      4. argc and argv Initialization
      5. Call Global Constructors
    2. __libc_init_array
    3. __libc_fini_array
    4. Heap Limit and malloc
    5. atexit Family
      1. atexit
      2. __cxa_atexit
      3. __register_exitproc
      4. Automatic Registration of Destructors
    6. exit Family
      1. exit
      2. _exit
      3. __call_exitprocs
      4. _kill
  7. Visual Summary
  8. Startup Activity Checklist
  9. Further Reading

ARM Procedure Call Standard

Since we are going to look at ARM assembly, we will need to familiarize ourselves with the basics of the Procedure Call Standard for ARM Applications.

There are sixteen 32-bit registers and a status register (CPSR) in the ARM and Thumb instruction sets:

  • r0 (aka a1) is Argument register 1 and a result register
  • r1 (aka a2) is Argument register 2 and a result register
  • r2 (aka a3) is Argument register 3
  • r3 (aka a4) is Argument register 4
  • r4 (aka v1) is Variable register 1
  • r5 (aka v2) is Variable register 2
  • r6 (aka v3) is Variable register 3
  • r7 (aka v4) is Variable register 4
  • r8 (aka v5) is Variable register 5
  • r9 usage changes depending on the platform
  • r10 (aka v7) is Variable register 7
  • r11 (aka v7) is Variable register 8
  • r12 is the IP special purpose register (intra-procedure-call scratch register)
  • r2` is the SP special register (stack pointer)
  • r14 is the LR special register (link register)
  • r15 is the PC special register (program counter)

The standard says the following for the argument registers (r0-r3):

The first four registers r0-r3 (a1-a4) are used to pass argument values into a subroutine and to return a result value from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).

We have multiple registers to hold the value of local variables:

Typically, the registers r4-r8, r10 and r11 (v1-v5, v7 and v8) are used to hold the values of a routine’s local variables. Of these, only v1-v4 can be used uniformly by the whole Thumb instruction set, but the AAPCS does not require that Thumb code only use those registers.

We must preserve specific registers when calling functions:

A subroutine must preserve the contents of the registers r4-r8, r10, r11 and SP (and r9 in PCS variants that designate r9 as v6)

ARM specifies that the stack pointer (SP) must always be aligned to a word boundary (i.e., sp % 4 == 0). For public interfaces, the stack must be aligned to a double-word boundary (i.e., sp % 8 == 0).

The least significant bit of a function address is an ARM/Thumb flag (1 == ARM, 0 == Thumb). This bit is set by the linker.

When we want to call a subroutine, we need to preserve the current function's persistent registers on the stack, store the return address in the LR register (so we know how to get back from our function), and change the PC to the subroutine address. ARM provides branching instructions which handle this process for us (e.g., bl, blx,bx`), although the process may still be performed manually.

Now, there are many details that we did not cover, but this basic overview provides enough details to understand some of the assembly that we will be analyzing. Particularly important to keep in mind: values put into r0-r3 represent arguments to functions, and values put into r4-r11 represent variables used in our current subroutine.

System Configuration

For this exploration, I used a Nordic nRF52840 Development Kit. The development kit has several examples provided by Nordic; I used the blinky program. I compiled and linked the program with the GNU ARM toolchain (version 8-2018-q4-major). The Nordic blinky program links against the Newlib libraries provided by the GNU ARM toolchain.

Because this is a Cortex-M processor, the program is compiled entirely in Thumb mode. We will discuss some aspects of the boot process which apply to Cortex-A processors that use ARM instructions.

Initial Exploration

Before we start blindly looking through the Newlib code base, we should do some initial exploration with our debugger as described in the last article.

To begin the investigation, I compiled the blinky example for the nRF52840 Development Kit (PCA10056 in the SDK parlance) in the "blank" configuration using the armgcc Makefile. I flashed the binary to the board with the nRF Connect Programmer,

First, let's start with a backtrace from main in an example program so we can see what code is run. Then we will look at the disassembly for the _start function that is provided by Newlib.

nRF52840 Boot

Our backtrace showed us that our journey begins in the Reset_Handler function in gcc_startup_nrf52840.S (found in the nRF SDK).

The file begins by providing for stack storage:

.section .stack
#if defined(__STARTUP_CONFIG)
    .align __STARTUP_CONFIG_STACK_ALIGNEMENT
    .equ    Stack_Size, __STARTUP_CONFIG_STACK_SIZE
#elif defined(__STACK_SIZE)
    .align 3
    .equ    Stack_Size, __STACK_SIZE
#else
    .align 3
    .equ    Stack_Size, 8192
#endif
    .globl __StackTop
    .globl __StackLimit
__StackLimit:
    .space Stack_Size
    .size __StackLimit, . - __StackLimit
__StackTop:
    .size __StackTop, . - __StackTop

There are also provisions for heap storage:

.section .heap
    .align 3
#if defined(__STARTUP_CONFIG)
    .equ Heap_Size, __STARTUP_CONFIG_HEAP_SIZE
#elif defined(__HEAP_SIZE)
    .equ Heap_Size, __HEAP_SIZE
#else
    .equ Heap_Size, 8192
#endif
    .globl __HeapBase
    .globl __HeapLimit
__HeapBase:
    .if Heap_Size
    .space Heap_Size
    .endif
    .size __HeapBase, . - __HeapBase
__HeapLimit:
    .size __HeapLimit, . - __HeapLimit

This file also contains a declaration of all interrupt vectors and their associated handlers. A small sample is shown:

.section .isr_vector
    .align 2
    .globl __isr_vector
__isr_vector:
    .long   __StackTop                  /* Top of Stack */
    .long   Reset_Handler
    .long   NMI_Handler
    .long   HardFault_Handler
    .long   MemoryManagement_Handler
    .long   BusFault_Handler
    .long   UsageFault_Handler

    /// ...

    .size __isr_vector, . - __isr_vector

We then find the declaration of Reset_Handler:

.text
    .thumb
    .thumb_func
    .align 1
    .globl Reset_Handler
    .type Reset_Handler, %function
Reset_Handler:

Load from Flash to RAM

First, the reset handler copies data from flash to RAM.

The data is copied from the address of the __etext symbol, which represents the end of the .text section in flash storage. The data is copied to the address indicated by the __data_start__ symbol, and the number of bytes copied is calculated by subtracting the __data_start__ address from __bss_start__, which indicates the beginning of the next section. As the nRF startup code explains, __bss_start__ is used so users can insert their own initialized data section before the .bss section. Using this logic, it will be copied to RAM without any changes from the user.

ldr r1, =__etext
    ldr r2, =__data_start__
    ldr r3, =__bss_start__

    subs r3, r3, r2
    ble .L_loop1_done

.L_loop1:
    subs r3, r3, #4
    ldr r0, [r1,r3]
    str r0, [r2,r3]
    bgt .L_loop1

Optional: Clear .bss

Once the .data section contents are copied to RAM, there is an optional step for initializing the .bss section contents to 0. In our case, this code is not compiled. Newlib handles .bss initialization.

.L_loop1_done:
/* This part of work usually is done in C library startup code. Otherwise,
 * define __STARTUP_CLEAR_BSS to enable it in this startup. This section
 * clears the RAM where BSS data is located.
 *
 * The BSS section is specified by following symbols
 *    __bss_start__: start of the BSS section.
 *    __bss_end__: end of the BSS section.
 *
 * All addresses must be aligned to 4 bytes boundary.
 */
#ifdef __STARTUP_CLEAR_BSS
    ldr r1, =__bss_start__
    ldr r2, =__bss_end__

    movs r0, 0

    subs r2, r2, r1
    ble .L_loop3_done

.L_loop3:
    subs r2, r2, #4
    str r0, [r1, r2]
    bgt .L_loop3

.L_loop3_done:
#endif /* __STARTUP_CLEAR_BSS */

SystemInit

Before invoking the C runtime startup routine, a SystemInit function is called. This function, which we will look at next, is responsible for initializing the processor and applying behavioral fixes for relevant errata.

bl SystemInit

Call _start

Once the processor is initialized, we call the _start function to initialize the C runtime. Note that the nRF startup code allows you to define a custom entry point with a compiler definition.

/* Call _start function provided by libraries.  If those libraries 
 * are not accessible, define __START as your entry point. */
#ifndef __START
#define __START _start
#endif
    bl __START

IRQ Handlers

The gcc_startup_nrf52840.S also contains dummy exception handler function definitions. For example:

.weak   NMI_Handler
    .type   NMI_Handler, %function
NMI_Handler:
    b       .
    .size   NMI_Handler, . - NMI_Handler

    .weak   HardFault_Handler
    .type   HardFault_Handler, %function
HardFault_Handler:
    b       .
    .size   HardFault_Handler, . - HardFault_Handler

A default handler is declared, which performs an infinite loop:

.globl  Default_Handler
    .type   Default_Handler, %function
Default_Handler:
    b       .
    .size   Default_Handler, . - Default_Handler

All other IRQ handlers are mapped to this default handler. Users are able to overwrite these handlers with their own implementations as needed.

.macro  IRQ handler
.weak   \handler
.set    \handler, Default_Handler
.endm

IRQ  POWER_CLOCK_IRQHandler
IRQ  RADIO_IRQHandler
IRQ  UARTE0_UART0_IRQHandler

/// ...

After the IRQ handlers are supplied, the file ends.

.end

nRF52 System Initialization

The SystemInit function is implemented in system_nrf52840.c (found in the nRF SDK). For a normal application, this file would be modified to suit the platform's requirements. We'll look at the default implementation for our processor.

First, SWO trace functionality is enabled in the processor. If ENABLE_SWO is not defined, the pin is left as normal GPIO.

#if defined (ENABLE_SWO)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Serial << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P1->PIN_CNF[0] = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);
#endif

Next, Trace functionality is enabled in the processor. If ENABLE_TRACE is not defined, the pins are left as normal GPIO.

#if defined (ENABLE_TRACE)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Parallel << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P0->PIN_CNF[7]  = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);

// ... more pin configurations in the actual implementation

#endif

Following debug configuration, the system checks for a variety of errata conditions and applies fixes as necessary. Here are a few examples:

/* Workaround for Errata 98 "NFCT: Not able to communicate with the peer"  */
if (errata_98()){
    *(volatile uint32_t *)0x4000568Cul = 0x00038148ul;
}

/* Workaround for Errata 103 "CCM: Wrong reset value of CCM MAXPACKETSIZE"  */
if (errata_103()){
    NRF_CCM->MAXPACKETSIZE = 0xFBul;
}

Following the errata section, the FPU is initialized if the program has been compiled with floating point support. The __FPU_USED macro is supplied by the compiler.

#if (__FPU_USED == 1)
    SCB->CPACR |= (3UL << 20) | (3UL << 22);
    __DSB();
    __ISB();
#endif

If NFC is not used for an nRF52 platform, the associated NFC pins are configured as normal GPIO.

#if defined (CONFIG_NFCT_PINS_AS_GPIOS)
    if ((NRF_UICR->NFCPINS & UICR_NFCPINS_PROTECT_Msk) == (UICR_NFCPINS_PROTECT_NFC << UICR_NFCPINS_PROTECT_Pos)){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->NFCPINS &= ~UICR_NFCPINS_PROTECT_Msk;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

The nRF allows a GPIO to be configured as a reset pin. If CONFIG_GPIO_AS_PINRESET is defined, a dedicated GPIO will be configured to act as a reset pin.

#if defined (CONFIG_GPIO_AS_PINRESET)
    if (((NRF_UICR->PSELRESET[0] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos)) ||
        ((NRF_UICR->PSELRESET[1] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos))){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[0] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[1] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

Finally, the system clock is initialized:

SystemCoreClockUpdate();

Newlib ARM Startup

After data has been relocated and the processor properly initialized, the reset handler calls the _start function. For our GCC ARM application, this function is supplied by Newlib.

The Newlib project is divided into two major parts: newlib and libgloss. The newlib portion is an implementation of libc and libm. The libgloss portion contains platform-specific code, such as startup files, board support packages, and I/O support for the C library.

When exploring the Newlib code base on your own, it is important to note the distinction between libgloss and newlib. The libgloss division happened after the inception of the Newlib project. Many of the same files are found in the newlib folder and the libgloss folder. For platform-specific code, you should prefer the libgloss implementations. These are newer, and the older implementations remain in the newlib folder for backwards compatibility with older targets.

Visual Summary


Visual Summary


Startup Activity Checklist

In the first article of this series, we reviewed a broad range of startup activities that occur before main is called.

Here is a checklist of actions that were observed in the Newlib ARM program startup procedures:

  • [x] Early low-level initialization of the processor/hardware
  • [x] Stack initialization
  • [x] Frame pointer initialization
  • [x] C/C++ runtime setup
    • [x] Handle relocations (some sections are copied from flash to RAM)
    • [x] Initialize .bss
    • [x] Call global constructors
    • [x] Prepare argc, argv (set to 0)
    • [ ] Prepare environment variables
    • [x] Heap initialization
    • [ ] stdio initialization
    • [ ] Initialize exception support
    • [x] Register destructors and other exit-time functionality
  • [ ] System scaffolding setup
    • [ ] Threading support
    • [ ] Thread local storage
    • [ ] Buffer overrun detection
    • [ ] Run-time error checks
    • [ ] Locale settings
    • [ ] Math error handling
    • [ ] Math precision
  • [x] Jump to main
  • [x] Exit after main

Further Reading

Change Log

  • 20190909:
    • Added links to a great Matt Godbolt talk
    • Added links to Memfault's "Zero to Main()" series

Related Articles