Programmers: Let's Study Source Code Classics

Updated: 20190729

Contemporary programmers are lucky: we live in a world where historical and influential program source code is available for us to review. However, most programmers only learn and study the programs they have worked on themselves. We rarely take the time to study historical works, and programming courses don’t typically spend any time on the subject.

We believe that software developers should review influential source code. This is similar to architects studying influential building designs (and critiques of those designs). Rather than repeating the same mistakes over-and-over, we should study the great works that preceded us and learn from and build upon their lessons.

Ideally, we would study great source code along with a commentary or critique which provides us information about the project’s context, successes, and failures. Such commentaries are rare, but here are a few excellent starting points:

You can also find a program you've used in the past and review the source code. It's important to start with a program you are familiar with, so you can anchor the functional behavior to the source code. Here are resources you can use to find and browse historical source code:

Change Log

  • 20190729:
    • Added additional Apollo 11 guidance computer links
  • 20190627:
    • Added links to Wolfenstein and DOOM source code + reviews

Exploring Startup Implementations: Newlib (ARM)

Updated: 20190909

For most programmers, a C or C++ program's life begins at the main function. They are blissfully unaware of the hidden steps that happen between invoking a program and executing main. Depending on the program and the compiler, there are all kinds of interesting functions that get run before main, automatically inserted by the compiler and linker and invisible to casual observers.

Unfortunately for programmers who are curious about the program startup process, the literature on what happens before main is quite sparse.

Embedded Artistry has been hard at working creating a C++ embedded framework. The final piece of the puzzle was implementing program startup code. To aid in the design of our framework's boot process, I performed an exploratory survey of existing program startup implementations. My goal is to identify a general program startup model. I also want to provide a more comprehensive look into how our programs get to main.

In this six-part series, we will be investigating what it takes to get to main:

  1. A General Overview of What Happens Before main()
  2. Exploring Startup Implementations: Newlib (ARM)
  3. Exploring Startup Implementations: OS X
  4. Exploring Startup Implementations: Custom Embedded System with ThreadX
  5. Abstracting a Generic Flow for Getting to main
  6. Implementing our Generic Startup Flow

Now that we have a high-level understanding of how our programs get to main, we can explore real-world implementations of program startup code.

Today's analysis focuses on Newlib. If you build embedded applications for ARM using the GNU arm-none-eabi toolchain, your program is linked with Newlib startup code by default. Newlib supports multiple architectures, but we will focus exclusively on the ARM startup path.

If you are interested in exploring Newlib startup routines on your own, you can download the Newlib source code or browse the source code online.

The boot flow is quite complicated, and it's easy to get mentally lost. You can refer to the Visual Summary throughout the article for a visual representation of the startup procedure and call stack.

Table of Contents:

  1. ARM Procedure Call Standard
  2. System Configuration
  3. Initial Exploration
    1. Boot Path
    2. _start Disassembly
  4. nRF52 Initial Boot
    1. Load from Flash to RAM
    2. Optional: Clear .bss
    3. SystemInit
    4. Call start
    5. IRQ Handlers
  5. nRF52 System Initialization
  6. Newlib ARM Startup
    1. crt0.s
      1. Stack Setup
      2. Initialize .bss
      3. Target-Specific Initialization
      4. argc and argv Initialization
      5. Call Global Constructors
    2. __libc_init_array
    3. __libc_fini_array
    4. Heap Limit and malloc
    5. atexit Family
      1. atexit
      2. __cxa_atexit
      3. __register_exitproc
      4. Automatic Registration of Destructors
    6. exit Family
      1. exit
      2. _exit
      3. __call_exitprocs
      4. _kill
  7. Visual Summary
  8. Startup Activity Checklist
  9. Further Reading

ARM Procedure Call Standard

Since we are going to look at ARM assembly, we will need to familiarize ourselves with the basics of the Procedure Call Standard for ARM Applications.

There are sixteen 32-bit registers and a status register (CPSR) in the ARM and Thumb instruction sets:

  • r0 (aka a1) is Argument register 1 and a result register
  • r1 (aka a2) is Argument register 2 and a result register
  • r2 (aka a3) is Argument register 3
  • r3 (aka a4) is Argument register 4
  • r4 (aka v1) is Variable register 1
  • r5 (aka v2) is Variable register 2
  • r6 (aka v3) is Variable register 3
  • r7 (aka v4) is Variable register 4
  • r8 (aka v5) is Variable register 5
  • r9 usage changes depending on the platform
  • r10 (aka v7) is Variable register 7
  • r11 (aka v7) is Variable register 8
  • r12 is the IP special purpose register (intra-procedure-call scratch register)
  • r2` is the SP special register (stack pointer)
  • r14 is the LR special register (link register)
  • r15 is the PC special register (program counter)

The standard says the following for the argument registers (r0-r3):

The first four registers r0-r3 (a1-a4) are used to pass argument values into a subroutine and to return a result value from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).

We have multiple registers to hold the value of local variables:

Typically, the registers r4-r8, r10 and r11 (v1-v5, v7 and v8) are used to hold the values of a routine’s local variables. Of these, only v1-v4 can be used uniformly by the whole Thumb instruction set, but the AAPCS does not require that Thumb code only use those registers.

We must preserve specific registers when calling functions:

A subroutine must preserve the contents of the registers r4-r8, r10, r11 and SP (and r9 in PCS variants that designate r9 as v6)

ARM specifies that the stack pointer (SP) must always be aligned to a word boundary (i.e., sp % 4 == 0). For public interfaces, the stack must be aligned to a double-word boundary (i.e., sp % 8 == 0).

The least significant bit of a function address is an ARM/Thumb flag (1 == ARM, 0 == Thumb). This bit is set by the linker.

When we want to call a subroutine, we need to preserve the current function's persistent registers on the stack, store the return address in the LR register (so we know how to get back from our function), and change the PC to the subroutine address. ARM provides branching instructions which handle this process for us (e.g., bl, blx,bx`), although the process may still be performed manually.

Now, there are many details that we did not cover, but this basic overview provides enough details to understand some of the assembly that we will be analyzing. Particularly important to keep in mind: values put into r0-r3 represent arguments to functions, and values put into r4-r11 represent variables used in our current subroutine.

System Configuration

For this exploration, I used a Nordic nRF52840 Development Kit. The development kit has several examples provided by Nordic; I used the blinky program. I compiled and linked the program with the GNU ARM toolchain (version 8-2018-q4-major). The Nordic blinky program links against the Newlib libraries provided by the GNU ARM toolchain.

Because this is a Cortex-M processor, the program is compiled entirely in Thumb mode. We will discuss some aspects of the boot process which apply to Cortex-A processors that use ARM instructions.

Initial Exploration

Before we start blindly looking through the Newlib code base, we should do some initial exploration with our debugger as described in the last article.

To begin the investigation, I compiled the blinky example for the nRF52840 Development Kit (PCA10056 in the SDK parlance) in the "blank" configuration using the armgcc Makefile. I flashed the binary to the board with the nRF Connect Programmer,

First, let's start with a backtrace from main in an example program so we can see what code is run. Then we will look at the disassembly for the _start function that is provided by Newlib.

nRF52840 Boot

Our backtrace showed us that our journey begins in the Reset_Handler function in gcc_startup_nrf52840.S (found in the nRF SDK).

The file begins by providing for stack storage:

.section .stack
#if defined(__STARTUP_CONFIG)
    .align __STARTUP_CONFIG_STACK_ALIGNEMENT
    .equ    Stack_Size, __STARTUP_CONFIG_STACK_SIZE
#elif defined(__STACK_SIZE)
    .align 3
    .equ    Stack_Size, __STACK_SIZE
#else
    .align 3
    .equ    Stack_Size, 8192
#endif
    .globl __StackTop
    .globl __StackLimit
__StackLimit:
    .space Stack_Size
    .size __StackLimit, . - __StackLimit
__StackTop:
    .size __StackTop, . - __StackTop

There are also provisions for heap storage:

.section .heap
    .align 3
#if defined(__STARTUP_CONFIG)
    .equ Heap_Size, __STARTUP_CONFIG_HEAP_SIZE
#elif defined(__HEAP_SIZE)
    .equ Heap_Size, __HEAP_SIZE
#else
    .equ Heap_Size, 8192
#endif
    .globl __HeapBase
    .globl __HeapLimit
__HeapBase:
    .if Heap_Size
    .space Heap_Size
    .endif
    .size __HeapBase, . - __HeapBase
__HeapLimit:
    .size __HeapLimit, . - __HeapLimit

This file also contains a declaration of all interrupt vectors and their associated handlers. A small sample is shown:

.section .isr_vector
    .align 2
    .globl __isr_vector
__isr_vector:
    .long   __StackTop                  /* Top of Stack */
    .long   Reset_Handler
    .long   NMI_Handler
    .long   HardFault_Handler
    .long   MemoryManagement_Handler
    .long   BusFault_Handler
    .long   UsageFault_Handler

    /// ...

    .size __isr_vector, . - __isr_vector

We then find the declaration of Reset_Handler:

.text
    .thumb
    .thumb_func
    .align 1
    .globl Reset_Handler
    .type Reset_Handler, %function
Reset_Handler:

Load from Flash to RAM

First, the reset handler copies data from flash to RAM.

The data is copied from the address of the __etext symbol, which represents the end of the .text section in flash storage. The data is copied to the address indicated by the __data_start__ symbol, and the number of bytes copied is calculated by subtracting the __data_start__ address from __bss_start__, which indicates the beginning of the next section. As the nRF startup code explains, __bss_start__ is used so users can insert their own initialized data section before the .bss section. Using this logic, it will be copied to RAM without any changes from the user.

ldr r1, =__etext
    ldr r2, =__data_start__
    ldr r3, =__bss_start__

    subs r3, r3, r2
    ble .L_loop1_done

.L_loop1:
    subs r3, r3, #4
    ldr r0, [r1,r3]
    str r0, [r2,r3]
    bgt .L_loop1

Optional: Clear .bss

Once the .data section contents are copied to RAM, there is an optional step for initializing the .bss section contents to 0. In our case, this code is not compiled. Newlib handles .bss initialization.

.L_loop1_done:
/* This part of work usually is done in C library startup code. Otherwise,
 * define __STARTUP_CLEAR_BSS to enable it in this startup. This section
 * clears the RAM where BSS data is located.
 *
 * The BSS section is specified by following symbols
 *    __bss_start__: start of the BSS section.
 *    __bss_end__: end of the BSS section.
 *
 * All addresses must be aligned to 4 bytes boundary.
 */
#ifdef __STARTUP_CLEAR_BSS
    ldr r1, =__bss_start__
    ldr r2, =__bss_end__

    movs r0, 0

    subs r2, r2, r1
    ble .L_loop3_done

.L_loop3:
    subs r2, r2, #4
    str r0, [r1, r2]
    bgt .L_loop3

.L_loop3_done:
#endif /* __STARTUP_CLEAR_BSS */

SystemInit

Before invoking the C runtime startup routine, a SystemInit function is called. This function, which we will look at next, is responsible for initializing the processor and applying behavioral fixes for relevant errata.

bl SystemInit

Call _start

Once the processor is initialized, we call the _start function to initialize the C runtime. Note that the nRF startup code allows you to define a custom entry point with a compiler definition.

/* Call _start function provided by libraries.  If those libraries 
 * are not accessible, define __START as your entry point. */
#ifndef __START
#define __START _start
#endif
    bl __START

IRQ Handlers

The gcc_startup_nrf52840.S also contains dummy exception handler function definitions. For example:

.weak   NMI_Handler
    .type   NMI_Handler, %function
NMI_Handler:
    b       .
    .size   NMI_Handler, . - NMI_Handler

    .weak   HardFault_Handler
    .type   HardFault_Handler, %function
HardFault_Handler:
    b       .
    .size   HardFault_Handler, . - HardFault_Handler

A default handler is declared, which performs an infinite loop:

.globl  Default_Handler
    .type   Default_Handler, %function
Default_Handler:
    b       .
    .size   Default_Handler, . - Default_Handler

All other IRQ handlers are mapped to this default handler. Users are able to overwrite these handlers with their own implementations as needed.

.macro  IRQ handler
.weak   \handler
.set    \handler, Default_Handler
.endm

IRQ  POWER_CLOCK_IRQHandler
IRQ  RADIO_IRQHandler
IRQ  UARTE0_UART0_IRQHandler

/// ...

After the IRQ handlers are supplied, the file ends.

.end

nRF52 System Initialization

The SystemInit function is implemented in system_nrf52840.c (found in the nRF SDK). For a normal application, this file would be modified to suit the platform's requirements. We'll look at the default implementation for our processor.

First, SWO trace functionality is enabled in the processor. If ENABLE_SWO is not defined, the pin is left as normal GPIO.

#if defined (ENABLE_SWO)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Serial << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P1->PIN_CNF[0] = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);
#endif

Next, Trace functionality is enabled in the processor. If ENABLE_TRACE is not defined, the pins are left as normal GPIO.

#if defined (ENABLE_TRACE)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Parallel << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P0->PIN_CNF[7]  = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);

// ... more pin configurations in the actual implementation

#endif

Following debug configuration, the system checks for a variety of errata conditions and applies fixes as necessary. Here are a few examples:

/* Workaround for Errata 98 "NFCT: Not able to communicate with the peer"  */
if (errata_98()){
    *(volatile uint32_t *)0x4000568Cul = 0x00038148ul;
}

/* Workaround for Errata 103 "CCM: Wrong reset value of CCM MAXPACKETSIZE"  */
if (errata_103()){
    NRF_CCM->MAXPACKETSIZE = 0xFBul;
}

Following the errata section, the FPU is initialized if the program has been compiled with floating point support. The __FPU_USED macro is supplied by the compiler.

#if (__FPU_USED == 1)
    SCB->CPACR |= (3UL << 20) | (3UL << 22);
    __DSB();
    __ISB();
#endif

If NFC is not used for an nRF52 platform, the associated NFC pins are configured as normal GPIO.

#if defined (CONFIG_NFCT_PINS_AS_GPIOS)
    if ((NRF_UICR->NFCPINS & UICR_NFCPINS_PROTECT_Msk) == (UICR_NFCPINS_PROTECT_NFC << UICR_NFCPINS_PROTECT_Pos)){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->NFCPINS &= ~UICR_NFCPINS_PROTECT_Msk;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

The nRF allows a GPIO to be configured as a reset pin. If CONFIG_GPIO_AS_PINRESET is defined, a dedicated GPIO will be configured to act as a reset pin.

#if defined (CONFIG_GPIO_AS_PINRESET)
    if (((NRF_UICR->PSELRESET[0] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos)) ||
        ((NRF_UICR->PSELRESET[1] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos))){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[0] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[1] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

Finally, the system clock is initialized:

SystemCoreClockUpdate();

Newlib ARM Startup

After data has been relocated and the processor properly initialized, the reset handler calls the _start function. For our GCC ARM application, this function is supplied by Newlib.

The Newlib project is divided into two major parts: newlib and libgloss. The newlib portion is an implementation of libc and libm. The libgloss portion contains platform-specific code, such as startup files, board support packages, and I/O support for the C library.

When exploring the Newlib code base on your own, it is important to note the distinction between libgloss and newlib. The libgloss division happened after the inception of the Newlib project. Many of the same files are found in the newlib folder and the libgloss folder. For platform-specific code, you should prefer the libgloss implementations. These are newer, and the older implementations remain in the newlib folder for backwards compatibility with older targets.

Visual Summary


nRF52840 Boot

Our backtrace showed us that our journey begins in the Reset_Handler function in gcc_startup_nrf52840.S (found in the nRF SDK).

The file begins by providing for stack storage:

.section .stack
#if defined(__STARTUP_CONFIG)
    .align __STARTUP_CONFIG_STACK_ALIGNEMENT
    .equ    Stack_Size, __STARTUP_CONFIG_STACK_SIZE
#elif defined(__STACK_SIZE)
    .align 3
    .equ    Stack_Size, __STACK_SIZE
#else
    .align 3
    .equ    Stack_Size, 8192
#endif
    .globl __StackTop
    .globl __StackLimit
__StackLimit:
    .space Stack_Size
    .size __StackLimit, . - __StackLimit
__StackTop:
    .size __StackTop, . - __StackTop

There are also provisions for heap storage:

.section .heap
    .align 3
#if defined(__STARTUP_CONFIG)
    .equ Heap_Size, __STARTUP_CONFIG_HEAP_SIZE
#elif defined(__HEAP_SIZE)
    .equ Heap_Size, __HEAP_SIZE
#else
    .equ Heap_Size, 8192
#endif
    .globl __HeapBase
    .globl __HeapLimit
__HeapBase:
    .if Heap_Size
    .space Heap_Size
    .endif
    .size __HeapBase, . - __HeapBase
__HeapLimit:
    .size __HeapLimit, . - __HeapLimit

This file also contains a declaration of all interrupt vectors and their associated handlers. A small sample is shown:

.section .isr_vector
    .align 2
    .globl __isr_vector
__isr_vector:
    .long   __StackTop                  /* Top of Stack */
    .long   Reset_Handler
    .long   NMI_Handler
    .long   HardFault_Handler
    .long   MemoryManagement_Handler
    .long   BusFault_Handler
    .long   UsageFault_Handler

    /// ...

    .size __isr_vector, . - __isr_vector

We then find the declaration of Reset_Handler:

.text
    .thumb
    .thumb_func
    .align 1
    .globl Reset_Handler
    .type Reset_Handler, %function
Reset_Handler:

Load from Flash to RAM

First, the reset handler copies data from flash to RAM.

The data is copied from the address of the __etext symbol, which represents the end of the .text section in flash storage. The data is copied to the address indicated by the __data_start__ symbol, and the number of bytes copied is calculated by subtracting the __data_start__ address from __bss_start__, which indicates the beginning of the next section. As the nRF startup code explains, __bss_start__ is used so users can insert their own initialized data section before the .bss section. Using this logic, it will be copied to RAM without any changes from the user.

ldr r1, =__etext
    ldr r2, =__data_start__
    ldr r3, =__bss_start__

    subs r3, r3, r2
    ble .L_loop1_done

.L_loop1:
    subs r3, r3, #4
    ldr r0, [r1,r3]
    str r0, [r2,r3]
    bgt .L_loop1

Optional: Clear .bss

Once the .data section contents are copied to RAM, there is an optional step for initializing the .bss section contents to 0. In our case, this code is not compiled. Newlib handles .bss initialization.

.L_loop1_done:
/* This part of work usually is done in C library startup code. Otherwise,
 * define __STARTUP_CLEAR_BSS to enable it in this startup. This section
 * clears the RAM where BSS data is located.
 *
 * The BSS section is specified by following symbols
 *    __bss_start__: start of the BSS section.
 *    __bss_end__: end of the BSS section.
 *
 * All addresses must be aligned to 4 bytes boundary.
 */
#ifdef __STARTUP_CLEAR_BSS
    ldr r1, =__bss_start__
    ldr r2, =__bss_end__

    movs r0, 0

    subs r2, r2, r1
    ble .L_loop3_done

.L_loop3:
    subs r2, r2, #4
    str r0, [r1, r2]
    bgt .L_loop3

.L_loop3_done:
#endif /* __STARTUP_CLEAR_BSS */

SystemInit

Before invoking the C runtime startup routine, a SystemInit function is called. This function, which we will look at next, is responsible for initializing the processor and applying behavioral fixes for relevant errata.

bl SystemInit

Call _start

Once the processor is initialized, we call the _start function to initialize the C runtime. Note that the nRF startup code allows you to define a custom entry point with a compiler definition.

/* Call _start function provided by libraries.  If those libraries 
 * are not accessible, define __START as your entry point. */
#ifndef __START
#define __START _start
#endif
    bl __START

IRQ Handlers

The gcc_startup_nrf52840.S also contains dummy exception handler function definitions. For example:

.weak   NMI_Handler
    .type   NMI_Handler, %function
NMI_Handler:
    b       .
    .size   NMI_Handler, . - NMI_Handler

    .weak   HardFault_Handler
    .type   HardFault_Handler, %function
HardFault_Handler:
    b       .
    .size   HardFault_Handler, . - HardFault_Handler

A default handler is declared, which performs an infinite loop:

.globl  Default_Handler
    .type   Default_Handler, %function
Default_Handler:
    b       .
    .size   Default_Handler, . - Default_Handler

All other IRQ handlers are mapped to this default handler. Users are able to overwrite these handlers with their own implementations as needed.

.macro  IRQ handler
.weak   \handler
.set    \handler, Default_Handler
.endm

IRQ  POWER_CLOCK_IRQHandler
IRQ  RADIO_IRQHandler
IRQ  UARTE0_UART0_IRQHandler

/// ...

After the IRQ handlers are supplied, the file ends.

.end

nRF52 System Initialization

The SystemInit function is implemented in system_nrf52840.c (found in the nRF SDK). For a normal application, this file would be modified to suit the platform's requirements. We'll look at the default implementation for our processor.

First, SWO trace functionality is enabled in the processor. If ENABLE_SWO is not defined, the pin is left as normal GPIO.

#if defined (ENABLE_SWO)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Serial << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P1->PIN_CNF[0] = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);
#endif

Next, Trace functionality is enabled in the processor. If ENABLE_TRACE is not defined, the pins are left as normal GPIO.

#if defined (ENABLE_TRACE)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Parallel << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P0->PIN_CNF[7]  = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);

// ... more pin configurations in the actual implementation

#endif

Following debug configuration, the system checks for a variety of errata conditions and applies fixes as necessary. Here are a few examples:

/* Workaround for Errata 98 "NFCT: Not able to communicate with the peer"  */
if (errata_98()){
    *(volatile uint32_t *)0x4000568Cul = 0x00038148ul;
}

/* Workaround for Errata 103 "CCM: Wrong reset value of CCM MAXPACKETSIZE"  */
if (errata_103()){
    NRF_CCM->MAXPACKETSIZE = 0xFBul;
}

Following the errata section, the FPU is initialized if the program has been compiled with floating point support. The __FPU_USED macro is supplied by the compiler.

#if (__FPU_USED == 1)
    SCB->CPACR |= (3UL << 20) | (3UL << 22);
    __DSB();
    __ISB();
#endif

If NFC is not used for an nRF52 platform, the associated NFC pins are configured as normal GPIO.

#if defined (CONFIG_NFCT_PINS_AS_GPIOS)
    if ((NRF_UICR->NFCPINS & UICR_NFCPINS_PROTECT_Msk) == (UICR_NFCPINS_PROTECT_NFC << UICR_NFCPINS_PROTECT_Pos)){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->NFCPINS &= ~UICR_NFCPINS_PROTECT_Msk;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

The nRF allows a GPIO to be configured as a reset pin. If CONFIG_GPIO_AS_PINRESET is defined, a dedicated GPIO will be configured to act as a reset pin.

#if defined (CONFIG_GPIO_AS_PINRESET)
    if (((NRF_UICR->PSELRESET[0] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos)) ||
        ((NRF_UICR->PSELRESET[1] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos))){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[0] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[1] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

Finally, the system clock is initialized:

SystemCoreClockUpdate();

Updated: 20190909

For most programmers, a C or C++ program's life begins at the main function. They are blissfully unaware of the hidden steps that happen between invoking a program and executing main. Depending on the program and the compiler, there are all kinds of interesting functions that get run before main, automatically inserted by the compiler and linker and invisible to casual observers.

Unfortunately for programmers who are curious about the program startup process, the literature on what happens before main is quite sparse.

Embedded Artistry has been hard at working creating a C++ embedded framework. The final piece of the puzzle was implementing program startup code. To aid in the design of our framework's boot process, I performed an exploratory survey of existing program startup implementations. My goal is to identify a general program startup model. I also want to provide a more comprehensive look into how our programs get to main.

In this six-part series, we will be investigating what it takes to get to main:

  1. A General Overview of What Happens Before main()
  2. Exploring Startup Implementations: Newlib (ARM)
  3. Exploring Startup Implementations: OS X
  4. Exploring Startup Implementations: Custom Embedded System with ThreadX
  5. Abstracting a Generic Flow for Getting to main
  6. Implementing our Generic Startup Flow

Now that we have a high-level understanding of how our programs get to main, we can explore real-world implementations of program startup code.

Today's analysis focuses on Newlib. If you build embedded applications for ARM using the GNU arm-none-eabi toolchain, your program is linked with Newlib startup code by default. Newlib supports multiple architectures, but we will focus exclusively on the ARM startup path.

If you are interested in exploring Newlib startup routines on your own, you can download the Newlib source code or browse the source code online.

The boot flow is quite complicated, and it's easy to get mentally lost. You can refer to the Visual Summary throughout the article for a visual representation of the startup procedure and call stack.

Table of Contents:

  1. ARM Procedure Call Standard
  2. System Configuration
  3. Initial Exploration
    1. Boot Path
    2. _start Disassembly
  4. nRF52 Initial Boot
    1. Load from Flash to RAM
    2. Optional: Clear .bss
    3. SystemInit
    4. Call start
    5. IRQ Handlers
  5. nRF52 System Initialization
  6. Newlib ARM Startup
    1. crt0.s
      1. Stack Setup
      2. Initialize .bss
      3. Target-Specific Initialization
      4. argc and argv Initialization
      5. Call Global Constructors
    2. __libc_init_array
    3. __libc_fini_array
    4. Heap Limit and malloc
    5. atexit Family
      1. atexit
      2. __cxa_atexit
      3. __register_exitproc
      4. Automatic Registration of Destructors
    6. exit Family
      1. exit
      2. _exit
      3. __call_exitprocs
      4. _kill
  7. Visual Summary
  8. Startup Activity Checklist
  9. Further Reading

ARM Procedure Call Standard

Since we are going to look at ARM assembly, we will need to familiarize ourselves with the basics of the Procedure Call Standard for ARM Applications.

There are sixteen 32-bit registers and a status register (CPSR) in the ARM and Thumb instruction sets:

  • r0 (aka a1) is Argument register 1 and a result register
  • r1 (aka a2) is Argument register 2 and a result register
  • r2 (aka a3) is Argument register 3
  • r3 (aka a4) is Argument register 4
  • r4 (aka v1) is Variable register 1
  • r5 (aka v2) is Variable register 2
  • r6 (aka v3) is Variable register 3
  • r7 (aka v4) is Variable register 4
  • r8 (aka v5) is Variable register 5
  • r9 usage changes depending on the platform
  • r10 (aka v7) is Variable register 7
  • r11 (aka v7) is Variable register 8
  • r12 is the IP special purpose register (intra-procedure-call scratch register)
  • r2` is the SP special register (stack pointer)
  • r14 is the LR special register (link register)
  • r15 is the PC special register (program counter)

The standard says the following for the argument registers (r0-r3):

The first four registers r0-r3 (a1-a4) are used to pass argument values into a subroutine and to return a result value from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).

We have multiple registers to hold the value of local variables:

Typically, the registers r4-r8, r10 and r11 (v1-v5, v7 and v8) are used to hold the values of a routine’s local variables. Of these, only v1-v4 can be used uniformly by the whole Thumb instruction set, but the AAPCS does not require that Thumb code only use those registers.

We must preserve specific registers when calling functions:

A subroutine must preserve the contents of the registers r4-r8, r10, r11 and SP (and r9 in PCS variants that designate r9 as v6)

ARM specifies that the stack pointer (SP) must always be aligned to a word boundary (i.e., sp % 4 == 0). For public interfaces, the stack must be aligned to a double-word boundary (i.e., sp % 8 == 0).

The least significant bit of a function address is an ARM/Thumb flag (1 == ARM, 0 == Thumb). This bit is set by the linker.

When we want to call a subroutine, we need to preserve the current function's persistent registers on the stack, store the return address in the LR register (so we know how to get back from our function), and change the PC to the subroutine address. ARM provides branching instructions which handle this process for us (e.g., bl, blx,bx`), although the process may still be performed manually.

Now, there are many details that we did not cover, but this basic overview provides enough details to understand some of the assembly that we will be analyzing. Particularly important to keep in mind: values put into r0-r3 represent arguments to functions, and values put into r4-r11 represent variables used in our current subroutine.

System Configuration

For this exploration, I used a Nordic nRF52840 Development Kit. The development kit has several examples provided by Nordic; I used the blinky program. I compiled and linked the program with the GNU ARM toolchain (version 8-2018-q4-major). The Nordic blinky program links against the Newlib libraries provided by the GNU ARM toolchain.

Because this is a Cortex-M processor, the program is compiled entirely in Thumb mode. We will discuss some aspects of the boot process which apply to Cortex-A processors that use ARM instructions.

Initial Exploration

Before we start blindly looking through the Newlib code base, we should do some initial exploration with our debugger as described in the last article.

To begin the investigation, I compiled the blinky example for the nRF52840 Development Kit (PCA10056 in the SDK parlance) in the "blank" configuration using the armgcc Makefile. I flashed the binary to the board with the nRF Connect Programmer,

First, let's start with a backtrace from main in an example program so we can see what code is run. Then we will look at the disassembly for the _start function that is provided by Newlib.

nRF52840 Boot

Our backtrace showed us that our journey begins in the Reset_Handler function in gcc_startup_nrf52840.S (found in the nRF SDK).

The file begins by providing for stack storage:

.section .stack
#if defined(__STARTUP_CONFIG)
    .align __STARTUP_CONFIG_STACK_ALIGNEMENT
    .equ    Stack_Size, __STARTUP_CONFIG_STACK_SIZE
#elif defined(__STACK_SIZE)
    .align 3
    .equ    Stack_Size, __STACK_SIZE
#else
    .align 3
    .equ    Stack_Size, 8192
#endif
    .globl __StackTop
    .globl __StackLimit
__StackLimit:
    .space Stack_Size
    .size __StackLimit, . - __StackLimit
__StackTop:
    .size __StackTop, . - __StackTop

There are also provisions for heap storage:

.section .heap
    .align 3
#if defined(__STARTUP_CONFIG)
    .equ Heap_Size, __STARTUP_CONFIG_HEAP_SIZE
#elif defined(__HEAP_SIZE)
    .equ Heap_Size, __HEAP_SIZE
#else
    .equ Heap_Size, 8192
#endif
    .globl __HeapBase
    .globl __HeapLimit
__HeapBase:
    .if Heap_Size
    .space Heap_Size
    .endif
    .size __HeapBase, . - __HeapBase
__HeapLimit:
    .size __HeapLimit, . - __HeapLimit

This file also contains a declaration of all interrupt vectors and their associated handlers. A small sample is shown:

.section .isr_vector
    .align 2
    .globl __isr_vector
__isr_vector:
    .long   __StackTop                  /* Top of Stack */
    .long   Reset_Handler
    .long   NMI_Handler
    .long   HardFault_Handler
    .long   MemoryManagement_Handler
    .long   BusFault_Handler
    .long   UsageFault_Handler

    /// ...

    .size __isr_vector, . - __isr_vector

We then find the declaration of Reset_Handler:

.text
    .thumb
    .thumb_func
    .align 1
    .globl Reset_Handler
    .type Reset_Handler, %function
Reset_Handler:

Load from Flash to RAM

First, the reset handler copies data from flash to RAM.

The data is copied from the address of the __etext symbol, which represents the end of the .text section in flash storage. The data is copied to the address indicated by the __data_start__ symbol, and the number of bytes copied is calculated by subtracting the __data_start__ address from __bss_start__, which indicates the beginning of the next section. As the nRF startup code explains, __bss_start__ is used so users can insert their own initialized data section before the .bss section. Using this logic, it will be copied to RAM without any changes from the user.

ldr r1, =__etext
    ldr r2, =__data_start__
    ldr r3, =__bss_start__

    subs r3, r3, r2
    ble .L_loop1_done

.L_loop1:
    subs r3, r3, #4
    ldr r0, [r1,r3]
    str r0, [r2,r3]
    bgt .L_loop1

Optional: Clear .bss

Once the .data section contents are copied to RAM, there is an optional step for initializing the .bss section contents to 0. In our case, this code is not compiled. Newlib handles .bss initialization.

.L_loop1_done:
/* This part of work usually is done in C library startup code. Otherwise,
 * define __STARTUP_CLEAR_BSS to enable it in this startup. This section
 * clears the RAM where BSS data is located.
 *
 * The BSS section is specified by following symbols
 *    __bss_start__: start of the BSS section.
 *    __bss_end__: end of the BSS section.
 *
 * All addresses must be aligned to 4 bytes boundary.
 */
#ifdef __STARTUP_CLEAR_BSS
    ldr r1, =__bss_start__
    ldr r2, =__bss_end__

    movs r0, 0

    subs r2, r2, r1
    ble .L_loop3_done

.L_loop3:
    subs r2, r2, #4
    str r0, [r1, r2]
    bgt .L_loop3

.L_loop3_done:
#endif /* __STARTUP_CLEAR_BSS */

SystemInit

Before invoking the C runtime startup routine, a SystemInit function is called. This function, which we will look at next, is responsible for initializing the processor and applying behavioral fixes for relevant errata.

bl SystemInit

Call _start

Once the processor is initialized, we call the _start function to initialize the C runtime. Note that the nRF startup code allows you to define a custom entry point with a compiler definition.

/* Call _start function provided by libraries.  If those libraries 
 * are not accessible, define __START as your entry point. */
#ifndef __START
#define __START _start
#endif
    bl __START

IRQ Handlers

The gcc_startup_nrf52840.S also contains dummy exception handler function definitions. For example:

.weak   NMI_Handler
    .type   NMI_Handler, %function
NMI_Handler:
    b       .
    .size   NMI_Handler, . - NMI_Handler

    .weak   HardFault_Handler
    .type   HardFault_Handler, %function
HardFault_Handler:
    b       .
    .size   HardFault_Handler, . - HardFault_Handler

A default handler is declared, which performs an infinite loop:

.globl  Default_Handler
    .type   Default_Handler, %function
Default_Handler:
    b       .
    .size   Default_Handler, . - Default_Handler

All other IRQ handlers are mapped to this default handler. Users are able to overwrite these handlers with their own implementations as needed.

.macro  IRQ handler
.weak   \handler
.set    \handler, Default_Handler
.endm

IRQ  POWER_CLOCK_IRQHandler
IRQ  RADIO_IRQHandler
IRQ  UARTE0_UART0_IRQHandler

/// ...

After the IRQ handlers are supplied, the file ends.

.end

nRF52 System Initialization

The SystemInit function is implemented in system_nrf52840.c (found in the nRF SDK). For a normal application, this file would be modified to suit the platform's requirements. We'll look at the default implementation for our processor.

First, SWO trace functionality is enabled in the processor. If ENABLE_SWO is not defined, the pin is left as normal GPIO.

#if defined (ENABLE_SWO)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Serial << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P1->PIN_CNF[0] = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);
#endif

Next, Trace functionality is enabled in the processor. If ENABLE_TRACE is not defined, the pins are left as normal GPIO.

#if defined (ENABLE_TRACE)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Parallel << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P0->PIN_CNF[7]  = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);

// ... more pin configurations in the actual implementation

#endif

Following debug configuration, the system checks for a variety of errata conditions and applies fixes as necessary. Here are a few examples:

/* Workaround for Errata 98 "NFCT: Not able to communicate with the peer"  */
if (errata_98()){
    *(volatile uint32_t *)0x4000568Cul = 0x00038148ul;
}

/* Workaround for Errata 103 "CCM: Wrong reset value of CCM MAXPACKETSIZE"  */
if (errata_103()){
    NRF_CCM->MAXPACKETSIZE = 0xFBul;
}

Following the errata section, the FPU is initialized if the program has been compiled with floating point support. The __FPU_USED macro is supplied by the compiler.

#if (__FPU_USED == 1)
    SCB->CPACR |= (3UL << 20) | (3UL << 22);
    __DSB();
    __ISB();
#endif

If NFC is not used for an nRF52 platform, the associated NFC pins are configured as normal GPIO.

#if defined (CONFIG_NFCT_PINS_AS_GPIOS)
    if ((NRF_UICR->NFCPINS & UICR_NFCPINS_PROTECT_Msk) == (UICR_NFCPINS_PROTECT_NFC << UICR_NFCPINS_PROTECT_Pos)){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->NFCPINS &= ~UICR_NFCPINS_PROTECT_Msk;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

The nRF allows a GPIO to be configured as a reset pin. If CONFIG_GPIO_AS_PINRESET is defined, a dedicated GPIO will be configured to act as a reset pin.

#if defined (CONFIG_GPIO_AS_PINRESET)
    if (((NRF_UICR->PSELRESET[0] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos)) ||
        ((NRF_UICR->PSELRESET[1] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos))){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[0] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[1] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

Finally, the system clock is initialized:

SystemCoreClockUpdate();

Newlib ARM Startup

After data has been relocated and the processor properly initialized, the reset handler calls the _start function. For our GCC ARM application, this function is supplied by Newlib.

The Newlib project is divided into two major parts: newlib and libgloss. The newlib portion is an implementation of libc and libm. The libgloss portion contains platform-specific code, such as startup files, board support packages, and I/O support for the C library.

When exploring the Newlib code base on your own, it is important to note the distinction between libgloss and newlib. The libgloss division happened after the inception of the Newlib project. Many of the same files are found in the newlib folder and the libgloss folder. For platform-specific code, you should prefer the libgloss implementations. These are newer, and the older implementations remain in the newlib folder for backwards compatibility with older targets.

Visual Summary


Visual Summary


Startup Activity Checklist

In the first article of this series, we reviewed a broad range of startup activities that occur before main is called.

Here is a checklist of actions that were observed in the Newlib ARM program startup procedures:

  • [x] Early low-level initialization of the processor/hardware
  • [x] Stack initialization
  • [x] Frame pointer initialization
  • [x] C/C++ runtime setup
    • [x] Handle relocations (some sections are copied from flash to RAM)
    • [x] Initialize .bss
    • [x] Call global constructors
    • [x] Prepare argc, argv (set to 0)
    • [ ] Prepare environment variables
    • [x] Heap initialization
    • [ ] stdio initialization
    • [ ] Initialize exception support
    • [x] Register destructors and other exit-time functionality
  • [ ] System scaffolding setup
    • [ ] Threading support
    • [ ] Thread local storage
    • [ ] Buffer overrun detection
    • [ ] Run-time error checks
    • [ ] Locale settings
    • [ ] Math error handling
    • [ ] Math precision
  • [x] Jump to main
  • [x] Exit after main

Further Reading

Change Log

  • 20190909:
    • Added links to a great Matt Godbolt talk
    • Added links to Memfault's "Zero to Main()" series

Related Articles

A General Overview of What Happens Before main()

Updated: 20190909

For most programmers, a C or C++ program's life begins at the main function. They are blissfully unaware of the hidden steps that happen between invoking a program and executing main. Depending on the program and the compiler, there are all kinds of interesting functions that get run before main, automatically inserted by the compiler and linker and invisible to casual observers.

Unfortunately for programmers who are curious about the program startup process, the literature on what happens before main is quite sparse.

Embedded Artistry has been hard at working creating a C++ embedded framework. The final piece of the puzzle was implementing program startup code. To aid in the design of our framework's boot process, I performed an exploratory survey of existing program startup implementations. My goal is to identify a general program startup model. I also want to provide a more comprehensive look into how our programs get to main.

In this six-part series, we will be investigating what it takes to get to main:

  1. What Happens Before main()?
  2. Exploring Startup Implementations: Newlib (ARM)
  3. Exploring Startup Implementations: OS X
  4. Exploring Startup Implementations: Custom Embedded System with ThreadX
  5. Abstracting a Generic Flow for Getting to main
  6. Implementing our Generic Startup Flow

To begin our investigation, we will provide a summary of what happens in a program before main. The steps and responsibilities we describe are generalized so that they apply to most systems. We will supplement the general theory in the following articles with an analysis of real-world implementations.

Table of Contents:

  1. Getting to Main: A General Overviewy
    1. The _start Function
    2. Runtime Setup
    3. Other Scaffolding
    4. Jumping to main
    5. Returning from main
  2. How Do We Get to _start?
    1. Baremetal: reset vector
    2. Bootloader launches application
    3. OS Calls an exec function
  3. Exploring On Your Own
  4. Further Reading

Getting to Main: A General Overview

Before we dive into our exploration of how existing systems get to main, we should develop a hypothesis about what generally happens. Since others have already explored program startup, we can start with a clear idea of what happens before main.

How Do We Get to _start?

Now that we know how our program gets to main by way of the _start function, you may wonder how the program gets to _start.

There are three common paths:

  1. Baremetal: reset vector
  2. Bootloader launches application
  3. OS Calls an exec function

Baremetal: Reset Vector

A baremetal embedded application represents the simplest path to _start.

Consider a baremetal platform with a binary stored in flash memory. When power is applied to the processor, the processor will copy the program from flash and store it in RAM1. Once the program is loaded into memory, the processor jumps to the reset interrupt vector address.

The embedded program's reset interrupt handler initializes the system after power-on or reset. The reset handler typically performs an initial configuration of the processor registers and critical hardware components (such as external RAM, caches, or MMU). Once this initial configuration is complete, the reset handler jumps to _start.

Some systems do not utilize the C standard library, and in that case _start will not be called. Instead, the reset handler will invoke other setup functions or will directly execute necessary program setup steps.

1: If the chip supports execute-in-place (XIP), the processor will skip the copy step and run the program directly from flash memory.

Bootloader Launches Application

Many embedded applications are composed of multiple distinct images which run sequentially during the boot process.

Many systems use a bootloader or hypervisor, which runs before loading and executing the main application. Bootloaders perform a wide range of activities, including initializing system hardware, decryption, decompression, checking that a firmware image is valid before loading it, selecting a firmware image to boot, or determining whether to enter firmware update mode. Bootloader complexity depends on the system's requirements; not of the listed tasks tasks will be performed.

Other systems require an incremental boot process, especially when the main application is larger than the processor's internal RAM capacity. The first boot stage is typically a small image which fits into the processor's internal memory. This image will initialize external RAM and load the main application from flash into the external RAM. The first stage boot may perform additional steps, such as processor vector remapping or MMU configuration. Once the main application is loaded, the first stage boot invokes the reset vector of the main application.

Multi-stage boot scenarios complicate the program startup model. Each boot stage is technically a standalone program. However, not every stage will run through the full program startup process. Simple boot stages may only need to clear the .bss section to perform their duties, while complex bootloaders need a fully initialized C/C++ runtime. Program startup activities may be distributed across the boot process, with each stage handling specific tasks.

OS Calls an exec function

The most complex scenario is running a program on a host machine with a fully-fledged OS.

When you launch a program, your shell or GUI invokes a program loader. The loader is responsible for copying the application image from the hard drive into memory and configuring the environment that the program will run in. On Linux or OS X, the loader is a function in the exec() family typically execve() or execvp(). For Windows, the loader is the LdrInitializeThunk function in ntdll.dll.

Loaders will often perform the following actions:

  • Check permissions
  • Allocate space for the program's stack
  • Allocate space for the program's heap
  • Initialize registers (e.g., stack pointer)
  • Push argc, argv, and envp onto the program stack
  • Map virtual address spaces
  • Dynamic linking
  • Relocations
  • Call pre-initialization functions

Once the loader has configured the program environment, it calls the program's _start function.

Exploring On Your Own

In the next three articles, we will review a selection of startup procedures which differ greatly in terms of process and style:

  1. Newlib (ARM)
  2. OS X
  3. Custom Embedded System with ThreadX

We won't be reviewing Linux program startup, because there are already high-quality articles on that topic. For detailed descriptions about how Linux programs start, we recommend these articles:

  1. How Programs Get Run
  2. How Programs Get Run: ELF Binaries
  3. Linux x86 Program Start Up or - How the heck do we get to main()?

The startup code that your system runs is supplied by your libc implementation and system libraries, and the implementations will also vary depending on the target architecture. Don't be surprised if you find a different startup process than those described in this series and in other articles around the web.

You can explore your own program's startup behavior using objdump or a debugger (I.e. gdb, lldb). You can use debugging tools to tackle the problem from a variety of directions:

  1. Set a breakpoint at main() and run a backtrace to see the function call stack
  2. Set a breakpoint at _start() (or whatever entry point your backtrace shows) and step through the execution
  3. Dump the assembly output for the program using objdump

As Daniel Näslund pointed out in the comments, your debugger may be configured to suppress backtraces that go past the main function. For gdb, you can run the following command:

(gdb) set backtrace past-main on

Further Reading

Change Log

  • 20190909:
    • Added links to a great Matt Godbolt talk
    • Added links to Memfault's "Zero to Main()" series

Related Articles