Exploring Startup Implementations: Newlib (ARM)

For most programmers, a C or C++ program's life begins at the main function. They are blissfully unaware of the hidden steps that happen between invoking a program and executing main. Depending on the program and the compiler, there are all kinds of interesting functions that get run before main, automatically inserted by the compiler and linker and invisible to casual observers.

Unfortunately for programmers who are curious about the program startup process, the literature on what happens before main is quite sparse.

Embedded Artistry has been hard at working creating a C++ embedded framework. The final piece of the puzzle was implementing program startup code. To aid in the design of our framework's boot process, I performed an exploratory survey of existing program startup implementations. My goal is to identify a general program startup model. I also want to provide a more comprehensive look into how our programs get to main.

In this six-part series, we will be investigating what it takes to get to main:

  1. A General Overview of What Happens Before main()
  2. Exploring Startup Implementations: Newlib (ARM)
  3. Exploring Startup Implementations: OS X
  4. Exploring Startup Implementations: Custom Embedded System with ThreadX
  5. Abstracting a Generic Flow for Getting to main
  6. Implementing our Generic Startup Flow

Now that we have a high-level understanding of how our programs get to main, we can explore real-world implementations of program startup code.

Today's analysis focuses on Newlib. If you build embedded applications for ARM using the GNU arm-none-eabi toolchain, your program is linked with Newlib startup code by default. Newlib supports multiple architectures, but we will focus exclusively on the ARM startup path.

If you are interested in exploring Newlib startup routines on your own, you can download the Newlib source code or browse the source code online.

The boot flow is quite complicated, and it's easy to get mentally lost. You can refer to the Visual Summary throughout the article for a visual representation of the startup procedure and call stack.

Table of Contents:

  1. ARM Procedure Call Standard
  2. System Configuration
  3. Initial Exploration
    1. Boot Path
    2. _start Disassembly
  4. nRF52 Initial Boot
    1. Load from Flash to RAM
    2. Optional: Clear .bss
    3. SystemInit
    4. Call start
    5. IRQ Handlers
  5. nRF52 System Initialization
  6. Newlib ARM Startup
    1. crt0.s
      1. Stack Setup
      2. Initialize .bss
      3. Target-Specific Initialization
      4. argc and argv Initialization
      5. Call Global Constructors
    2. __libc_init_array
    3. __libc_fini_array
    4. Heap Limit and malloc
    5. atexit Family
      1. atexit
      2. __cxa_atexit
      3. __register_exitproc
      4. Automatic Registration of Destructors
    6. exit Family
      1. exit
      2. _exit
      3. __call_exitprocs
      4. _kill
  7. Visual Summary
  8. Startup Activity Checklist
  9. Further Reading

ARM Procedure Call Standard

Since we are going to look at ARM assembly, we will need to familiarize ourselves with the basics of the Procedure Call Standard for ARM Applications.

There are sixteen 32-bit registers and a status register (CPSR) in the ARM and Thumb instruction sets:

  • r0 (aka a1) is Argument register 1 and a result register
  • r1 (aka a2) is Argument register 2 and a result register
  • r2 (aka a3) is Argument register 3
  • r3 (aka a4) is Argument register 4
  • r4 (aka v1) is Variable register 1
  • r5 (aka v2) is Variable register 2
  • r6 (aka v3) is Variable register 3
  • r7 (aka v4) is Variable register 4
  • r8 (aka v5) is Variable register 5
  • r9 usage changes depending on the platform
  • r10 (aka v7) is Variable register 7
  • r11 (aka v7) is Variable register 8
  • r12 is the IP special purpose register (intra-procedure-call scratch register)
  • r2` is the SP special register (stack pointer)
  • r14 is the LR special register (link register)
  • r15 is the PC special register (program counter)

The standard says the following for the argument registers (r0-r3):

The first four registers r0-r3 (a1-a4) are used to pass argument values into a subroutine and to return a result value from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).

We have multiple registers to hold the value of local variables:

Typically, the registers r4-r8, r10 and r11 (v1-v5, v7 and v8) are used to hold the values of a routine’s local variables. Of these, only v1-v4 can be used uniformly by the whole Thumb instruction set, but the AAPCS does not require that Thumb code only use those registers.

We must preserve specific registers when calling functions:

A subroutine must preserve the contents of the registers r4-r8, r10, r11 and SP (and r9 in PCS variants that designate r9 as v6)

ARM specifies that the stack pointer (SP) must always be aligned to a word boundary (i.e., sp % 4 == 0). For public interfaces, the stack must be aligned to a double-word boundary (i.e., sp % 8 == 0).

The least significant bit of a function address is an ARM/Thumb flag (1 == ARM, 0 == Thumb). This bit is set by the linker.

When we want to call a subroutine, we need to preserve the current function's persistent registers on the stack, store the return address in the LR register (so we know how to get back from our function), and change the PC to the subroutine address. ARM provides branching instructions which handle this process for us (e.g., bl, blx,bx`), although the process may still be performed manually.

Now, there are many details that we did not cover, but this basic overview provides enough details to understand some of the assembly that we will be analyzing. Particularly important to keep in mind: values put into r0-r3 represent arguments to functions, and values put into r4-r11 represent variables used in our current subroutine.

System Configuration

For this exploration, I used a Nordic nRF52840 Development Kit. The development kit has several examples provided by Nordic; I used the blinky program. I compiled and linked the program with the GNU ARM toolchain (version 8-2018-q4-major). The Nordic blinky program links against the Newlib libraries provided by the GNU ARM toolchain.

Because this is a Cortex-M processor, the program is compiled entirely in Thumb mode. We will discuss some aspects of the boot process which apply to Cortex-A processors that use ARM instructions.

Initial Exploration

Before we start blindly looking through the Newlib code base, we should do some initial exploration with our debugger as described in the last article.

To begin the investigation, I compiled the blinky example for the nRF52840 Development Kit (PCA10056 in the SDK parlance) in the "blank" configuration using the armgcc Makefile. I flashed the binary to the board with the nRF Connect Programmer,

First, let's start with a backtrace from main in an example program so we can see what code is run. Then we will look at the disassembly for the _start function that is provided by Newlib.

Boot Path

To investigate the path our program takes to get to main, we'll use gdb. The nRF52 DK has USB connection with an on-board debugging chip. I fired up a Jlink gdb server and connected to my board usingarm-none-eabi-gdb`.

Once the board is connected, we load the symbols for our application:

(gdb) file _build/nrf52840_xxaa.out
A program is being debugged already.
Are you sure you want to change the file? (y or n) y
Reading symbols from _build/nrf52840_xxaa.out...

Set the breakpoint for main:

(gdb) b main
Breakpoint 1 at 0x380: file ../../../main.c, line 62.

Enable backtraces to extend past main:

(gdb) set backtrace past-main on

Then restart and run the program:

(gdb) mon reset
Resetting target
(gdb) c
Continuing.

Breakpoint 1, main () at ../../../main.c:62
62      bsp_board_init(BSP_INIT_LEDS);

Our initial backtrace shows a corrupt frame prior to _start:

(gdb) bt
#0  main () at ../../../main.c:62
#1  0x0000028e in _start ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

This can happen when the _start routine is messing with stacks or frame pointers to set up the program according to the library and ABI requirements. We can confirm this by setting a breakpoint at _start and re-starting the program. This will allow us to look at the state of the program before stack modifications.

(gdb) b _start
Breakpoint 2 at 0x258
(gdb) mon reset
Resetting target
(gdb) c
Continuing.

Breakpoint 2, 0x00000258 in _start ()
(gdb) bt
#0  0x00000258 in _start ()
#1  0x000002ce in Reset_Handler () at ../../../../../../modules/nrfx/mdk/gcc_startup_nrf52840.S:280

Our program receives control at the Reset_Handler function in our processor's startup code. This is expected for an embedded platform, since the processor loads our program from memory and begins execution at the reset vector address.

Now we know that there are two areas to investigate for startup, and gdb helpfully provided the path to the gcc_startup_nrf52840.S file, which is where our investigation of the source code will begin.

_start Disassembly

Before we dive into the source code, let's look at the disassembly for the _start function with gdb.

(gdb) disass /m _start
Dump of assembler code for function _start:
0x00001240 <+0>: ldr r3, [pc, #84] ; (0x1298 <_start+88>) 
0x00001242 <+2>: cmp r3, #0 
0x00001244 <+4>: it eq 
0x00001246 <+6>: ldreq r3, [pc, #76] ; (0x1294 <_start+84>) 
0x00001248 <+8>: mov sp, r3 
0x0000124a <+10>: sub.w r10, r3, #65536 ; 0x10000 
0x0000124e <+14>: movs r1, #0 
0x00001250 <+16>: mov r11, r1 
0x00001252 <+18>: mov r7, r1 
0x00001254 <+20>: ldr r0, [pc, #76] ; (0x12a4 <_start+100>) 
0x00001256 <+22>: ldr r2, [pc, #80] ; (0x12a8 <_start+104>) 
0x00001258 <+24>: subs r2, r2, r0 
0x0000125a <+26>: bl 0x330c <memset> 
0x0000125e <+30>: ldr r3, [pc, #60] ; (0x129c <_start+92>) 
0x00001260 <+32>: cmp r3, #0 
0x00001262 <+34>: beq.n 0x1266 <_start+38> 
0x00001264 <+36>: blx r3 
0x00001266 <+38>: ldr r3, [pc, #56] ; (0x12a0 <_start+96>) 
0x00001268 <+40>: cmp r3, #0 
0x0000126a <+42>: beq.n 0x126e <_start+46> 
0x0000126c <+44>: blx r3 
0x0000126e <+46>: movs r0, #0 
0x00001270 <+48>: movs r1, #0 
0x00001272 <+50>: movs r4, r0 
0x00001274 <+52>: movs r5, r1 
0x00001276 <+54>: ldr r0, [pc, #52] ; (0x12ac <_start+108>) 
0x00001278 <+56>: cmp r0, #0 
0x0000127a <+58>: beq.n 0x1282 <_start+66> 
0x0000127c <+60>: ldr r0, [pc, #48] ; (0x12b0 <_start+112>) 
0x0000127e <+62>: nop.w 
0x00001282 <+66>: bl 0x32b4 <__libc_init_array> 
0x00001286 <+70>: movs r0, r4 
0x00001288 <+72>: movs r1, r5 
0x0000128a <+74>: bl 0x1554 <main()> 
0x0000128e <+78>: bl 0x3268 <exit> 
0x00001292 <+82>: nop 
0x00001294 <+84>: movs r0, r0 
0x00001296 <+86>: movs r0, r1 
0x00001298 <+88>: movs r0, r0 
0x0000129a <+90>: movs r0, #4 
0x0000129c <+92>: movs r0, r0 
0x0000129e <+94>: movs r0, r0 
0x000012a0 <+96>: movs r0, r0 
0x000012a2 <+98>: movs r0, r0 
0x000012a4 <+100>: lsls r0, r4, #3 
0x000012a6 <+102>: movs r0, #0 
0x000012a8 <+104>: lsls r4, r4, #10 
0x000012aa <+106>: movs r0, #0 
0x000012ac <+108>: movs r0, r0 
0x000012ae <+110>: movs r0, r0 
0x000012b0 <+112>: movs r0, r0 
0x000012b2 <+114>: movs r0, r0

Disassembly Highlights

We won't reconstruct the entire process from disassembly, but we can quickly note some highlights.

First, the routine sets up the stack pointer using the r3 register:

0x00001248 <+8>: mov sp, r3

The Newlib _start function handles initializing the .bss section contents (which holds uninitialized global and static data) to 0. Note the call to memset: r1 holds the value we are setting ('0'); r0 holds the start address of the .bss section; r2 is loaded with the end address of the .bss section, and then the start address is subtracted from it, giving us the size of the section.

0000124e <+14>: movs r1, #0 
[...]
0x00001254 <+20>: ldr r0, [pc, #76] ; (0x12a4 <_start+100>) 
0x00001256 <+22>: ldr r2, [pc, #80] ; (0x12a8 <_start+104>) 
0x00001258 <+24>: subs r2, r2, r0 
0x0000125a <+26>: bl 0x330c <memset>

From the disassembly, I don't immediately understand what's happening after memset, but I do notice some function calls (blx instructions). I'm also guessing that _start initializes argc and argv to 0, then preserves those in r4-r5. Looking at the commented and non-optimized source will clarify this part of the process.

I do recognize the next function call, which is conveniently named. This call will initialize the global constructors:

0x00001282 <+66>: bl 0x32b4 <__libc_init_array>

After we've called the global constructors, we put the (presumed) argc and argv values into our argument registers, and then call main:

0x00001286 <+70>: movs r0, r4 
0x00001288 <+72>: movs r1, r5 
0x0000128a <+74>: bl 0x1554 <main()>

Since the r0 register holds the value that main returns, we can invoke exit without needing to modify the argument registers:

0x0000128e <+78>: bl 0x3268 <exit>

The assembly instructions following exit is a mystery to me from this view. Let's see what the source investigation reveals.


nRF52840 Boot

Our backtrace showed us that our journey begins in the Reset_Handler function in gcc_startup_nrf52840.S (found in the nRF SDK).

The file begins by providing for stack storage:

.section .stack
#if defined(__STARTUP_CONFIG)
    .align __STARTUP_CONFIG_STACK_ALIGNEMENT
    .equ    Stack_Size, __STARTUP_CONFIG_STACK_SIZE
#elif defined(__STACK_SIZE)
    .align 3
    .equ    Stack_Size, __STACK_SIZE
#else
    .align 3
    .equ    Stack_Size, 8192
#endif
    .globl __StackTop
    .globl __StackLimit
__StackLimit:
    .space Stack_Size
    .size __StackLimit, . - __StackLimit
__StackTop:
    .size __StackTop, . - __StackTop

There are also provisions for heap storage:

.section .heap
    .align 3
#if defined(__STARTUP_CONFIG)
    .equ Heap_Size, __STARTUP_CONFIG_HEAP_SIZE
#elif defined(__HEAP_SIZE)
    .equ Heap_Size, __HEAP_SIZE
#else
    .equ Heap_Size, 8192
#endif
    .globl __HeapBase
    .globl __HeapLimit
__HeapBase:
    .if Heap_Size
    .space Heap_Size
    .endif
    .size __HeapBase, . - __HeapBase
__HeapLimit:
    .size __HeapLimit, . - __HeapLimit

This file also contains a declaration of all interrupt vectors and their associated handlers. A small sample is shown:

.section .isr_vector
    .align 2
    .globl __isr_vector
__isr_vector:
    .long   __StackTop                  /* Top of Stack */
    .long   Reset_Handler
    .long   NMI_Handler
    .long   HardFault_Handler
    .long   MemoryManagement_Handler
    .long   BusFault_Handler
    .long   UsageFault_Handler

    /// ...

    .size __isr_vector, . - __isr_vector

We then find the declaration of Reset_Handler:

.text
    .thumb
    .thumb_func
    .align 1
    .globl Reset_Handler
    .type Reset_Handler, %function
Reset_Handler:

Load from Flash to RAM

First, the reset handler copies data from flash to RAM.

The data is copied from the address of the __etext symbol, which represents the end of the .text section in flash storage. The data is copied to the address indicated by the __data_start__ symbol, and the number of bytes copied is calculated by subtracting the __data_start__ address from __bss_start__, which indicates the beginning of the next section. As the nRF startup code explains, __bss_start__ is used so users can insert their own initialized data section before the .bss section. Using this logic, it will be copied to RAM without any changes from the user.

ldr r1, =__etext
    ldr r2, =__data_start__
    ldr r3, =__bss_start__

    subs r3, r3, r2
    ble .L_loop1_done

.L_loop1:
    subs r3, r3, #4
    ldr r0, [r1,r3]
    str r0, [r2,r3]
    bgt .L_loop1

Optional: Clear .bss

Once the .data section contents are copied to RAM, there is an optional step for initializing the .bss section contents to 0. In our case, this code is not compiled. Newlib handles .bss initialization.

.L_loop1_done:
/* This part of work usually is done in C library startup code. Otherwise,
 * define __STARTUP_CLEAR_BSS to enable it in this startup. This section
 * clears the RAM where BSS data is located.
 *
 * The BSS section is specified by following symbols
 *    __bss_start__: start of the BSS section.
 *    __bss_end__: end of the BSS section.
 *
 * All addresses must be aligned to 4 bytes boundary.
 */
#ifdef __STARTUP_CLEAR_BSS
    ldr r1, =__bss_start__
    ldr r2, =__bss_end__

    movs r0, 0

    subs r2, r2, r1
    ble .L_loop3_done

.L_loop3:
    subs r2, r2, #4
    str r0, [r1, r2]
    bgt .L_loop3

.L_loop3_done:
#endif /* __STARTUP_CLEAR_BSS */

SystemInit

Before invoking the C runtime startup routine, a SystemInit function is called. This function, which we will look at next, is responsible for initializing the processor and applying behavioral fixes for relevant errata.

bl SystemInit

Call _start

Once the processor is initialized, we call the _start function to initialize the C runtime. Note that the nRF startup code allows you to define a custom entry point with a compiler definition.

/* Call _start function provided by libraries.  If those libraries 
 * are not accessible, define __START as your entry point. */
#ifndef __START
#define __START _start
#endif
    bl __START

IRQ Handlers

The gcc_startup_nrf52840.S also contains dummy exception handler function definitions. For example:

.weak   NMI_Handler
    .type   NMI_Handler, %function
NMI_Handler:
    b       .
    .size   NMI_Handler, . - NMI_Handler

    .weak   HardFault_Handler
    .type   HardFault_Handler, %function
HardFault_Handler:
    b       .
    .size   HardFault_Handler, . - HardFault_Handler

A default handler is declared, which performs an infinite loop:

.globl  Default_Handler
    .type   Default_Handler, %function
Default_Handler:
    b       .
    .size   Default_Handler, . - Default_Handler

All other IRQ handlers are mapped to this default handler. Users are able to overwrite these handlers with their own implementations as needed.

.macro  IRQ handler
.weak   \handler
.set    \handler, Default_Handler
.endm

IRQ  POWER_CLOCK_IRQHandler
IRQ  RADIO_IRQHandler
IRQ  UARTE0_UART0_IRQHandler

/// ...

After the IRQ handlers are supplied, the file ends.

.end

nRF52 System Initialization

The SystemInit function is implemented in system_nrf52840.c (found in the nRF SDK). For a normal application, this file would be modified to suit the platform's requirements. We'll look at the default implementation for our processor.

First, SWO trace functionality is enabled in the processor. If ENABLE_SWO is not defined, the pin is left as normal GPIO.

#if defined (ENABLE_SWO)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Serial << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P1->PIN_CNF[0] = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);
#endif

Next, Trace functionality is enabled in the processor. If ENABLE_TRACE is not defined, the pins are left as normal GPIO.

#if defined (ENABLE_TRACE)
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Parallel << CLOCK_TRACECONFIG_TRACEMUX_Pos;
    NRF_P0->PIN_CNF[7]  = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos);

// ... more pin configurations in the actual implementation

#endif

Following debug configuration, the system checks for a variety of errata conditions and applies fixes as necessary. Here are a few examples:

/* Workaround for Errata 98 "NFCT: Not able to communicate with the peer"  */
if (errata_98()){
    *(volatile uint32_t *)0x4000568Cul = 0x00038148ul;
}

/* Workaround for Errata 103 "CCM: Wrong reset value of CCM MAXPACKETSIZE"  */
if (errata_103()){
    NRF_CCM->MAXPACKETSIZE = 0xFBul;
}

Following the errata section, the FPU is initialized if the program has been compiled with floating point support. The __FPU_USED macro is supplied by the compiler.

#if (__FPU_USED == 1)
    SCB->CPACR |= (3UL << 20) | (3UL << 22);
    __DSB();
    __ISB();
#endif

If NFC is not used for an nRF52 platform, the associated NFC pins are configured as normal GPIO.

#if defined (CONFIG_NFCT_PINS_AS_GPIOS)
    if ((NRF_UICR->NFCPINS & UICR_NFCPINS_PROTECT_Msk) == (UICR_NFCPINS_PROTECT_NFC << UICR_NFCPINS_PROTECT_Pos)){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->NFCPINS &= ~UICR_NFCPINS_PROTECT_Msk;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

The nRF allows a GPIO to be configured as a reset pin. If CONFIG_GPIO_AS_PINRESET is defined, a dedicated GPIO will be configured to act as a reset pin.

#if defined (CONFIG_GPIO_AS_PINRESET)
    if (((NRF_UICR->PSELRESET[0] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos)) ||
        ((NRF_UICR->PSELRESET[1] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos))){
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[0] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_UICR->PSELRESET[1] = 18;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos;
        while (NRF_NVMC->READY == NVMC_READY_READY_Busy){}
        NVIC_SystemReset();
    }
#endif

Finally, the system clock is initialized:

SystemCoreClockUpdate();

Newlib ARM Startup

After data has been relocated and the processor properly initialized, the reset handler calls the _start function. For our GCC ARM application, this function is supplied by Newlib.

The Newlib project is divided into two major parts: newlib and libgloss. The newlib portion is an implementation of libc and libm. The libgloss portion contains platform-specific code, such as startup files, board support packages, and I/O support for the C library.

When exploring the Newlib code base on your own, it is important to note the distinction between libgloss and newlib. The libgloss division happened after the inception of the Newlib project. Many of the same files are found in the newlib folder and the libgloss folder. For platform-specific code, you should prefer the libgloss implementations. These are newer, and the older implementations remain in the newlib folder for backwards compatibility with older targets.

crt0.s

The _start function for the ARM architecture is found in libgloss/arm/crt0.S

The _start function is quite lengthy, so I will be providing highlights of the full implementation. The startup code presented below has also simplified from the code found in crt0.S. The full implementation supports semi-hosting, where a debugger handles parts of the standard library functionality. I've removed the monitor-related code to simplify our current review.

Newlib implements a single runtime that supports both ARM and Thumb modes. This can be confusing, since not all operations apply to both modes. Because we are using a Cortex-M processor (the nRF52), the program is compiled entirely in Thumb mode. Some startup code only applies when ARM mode is enabled, and I will highlight this as best as I can.

The file opens with preprocessor definitions, logic for selecting the proper ARM/Thumb architecture, and a declaration of the _start`` function. The most important preprocessor entry for our current exploration isHAVE_INITFINI_ARRAY` selection logic.

#ifdef HAVE_INITFINI_ARRAY
#define _init   __libc_init_array
#define _fini   __libc_fini_array
#endif

When HAVE_INITFINI_ARRAY is defined, the _init and _fini function calls will be exchanged with __libc_init_array and __libc_fini_array respectively. This macro comes into play - our ARM program uses the .init_array and .fini_array sections.

We should also note an assembly macro which we will encounter in the startup code: indirect_call.

.macro indirect_call reg
#ifdef HAVE_CALL_INDIRECT
    blx \reg
#else
    mov lr, pc
    mov pc, \reg
#endif
.endm

The indirect_call is used to mimic blx behavior for architectures that do not support that instruction, as described in the summary of the ARM Procedure Call Standard.

We eventually reach the proper beginning of the _start function, which is aliased as _mainCRTStartup:

FUNC_START  _mainCRTStartup
    FUNC_START  _start
#if defined(__ELF__) && !defined(__USING_SJLJ_EXCEPTIONS__)
    /* Annotation for EABI unwinding tables.  */
    .fnstart
#endif

Stack Setup

The first order of business is to set up the stacks for the various ARM processor modes.

The linker script may provide the stack address with the __stack symbol, which is then made accessible to the assembly via .Lstack:

.Lstack:
    .word   __stack

The stack address is loaded and checked to make sure it is a non-zero value:

ldr r3, .Lstack
cmp r3, #0

If the __stack symbol is not defined, the alternate value provided in the .LC0 variable is used instead:

#ifdef __thumb2__
    it  eq
#endif
#ifdef THUMB1_ONLY
    bne .LC28
    ldr r3, .LC0
.LC28:
#else
    ldreq   r3, .LC0
#endif

Once the stack address is loaded into r3, we work through the various user modes and set up stacks and stack limits. This operation only applies to programs compiled in ARM mode, bceause Thumb has no concept of user modes.

If the processor is already operating in user mode, or if Thumb mode is being used, this section is skipped. Our Cortex-M-based nRF52 only uses Thumb mode, so this section is skipped.

/* Note: This 'mov' is essential when starting in User, and ensures we
         always get *some* sp value for the initial mode, even if we
         have somehow missed it below (in which case it gets the same
         value as FIQ - not ideal, but better than nothing.) */
    mov sp, r3
#ifdef PREFER_THUMB
    /* XXX Fill in stack assignments for interrupt modes.  */
#else
    mrs r2, CPSR
    tst r2, #0x0F   /* Test mode bits - in User of all are 0 */
    beq .LC23       /* "eq" means r2 AND #0x0F is 0 */
    msr     CPSR_c, #0xD1   /* FIRQ mode, interrupts disabled */
    mov     sp, r3
    sub sl, sp, #0x1000 /* This mode also has its own sl (see below) */

    mov r3, sl
    msr     CPSR_c, #0xD7   /* Abort mode, interrupts disabled */
    mov sp, r3
    sub r3, r3, #0x1000

    msr     CPSR_c, #0xDB   /* Undefined mode, interrupts disabled */
    mov sp, r3
    sub r3, r3, #0x1000

    msr     CPSR_c, #0xD2   /* IRQ mode, interrupts disabled */
    mov sp, r3
    sub r3, r3, #0x2000

    msr     CPSR_c, #0xD3   /* Supervisory mode, interrupts disabled */

    mov sp, r3
    sub r3, r3, #0x8000 /* Min size 32k */
    bic r3, r3, #0x00FF /* Align with current 64k block */
    bic r3, r3, #0xFF00

    str r3, [r3, #-4]   /* Move value into user mode sp without */
    ldmdb   r3, {sp}^       /* changing modes, via '^' form of ldm */
    orr r2, r2, #0xC0   /* Back to original mode, presumably SVC, */
    msr CPSR_c, r2  /* with FIQ/IRQ disable bits forced to 1 */
#endif

Note that setting up each mode is currently not performed for Thumb code. Only the user mode stack is initialized for thumb programs. That's why we did not observe this setup code in our disassembly of _start.

The last portion of the stack setup process puts an arbitrary stack limit in place. Unlike the __stack definition which is provided by the linker, the stack limit is an arbitrarily decided value of 64kB. This may be problematic if we have a larger stack or if the stack runs into the heap.

.LC23:
#ifdef THUMB1_ONLY
    movs    r2, #64
    lsls    r2, r2, #10
    subs    r2, r3, r2
    mov sl, r2
#else
    sub sl, r3, #64 << 10   /* Still assumes 256bytes below sl */
#endif

Initialize .bss

Once our stack is set up, the .bss sections I cleared. The .bss section start and end addresses are made available through the .LC1 and .LC2 variables:

.LC1:
    .word   __bss_start__
.LC2:
    .word   __bss_end__

The arguments to memset are loaded into registers, and the size is calculated:

/* Zero the memory in the .bss section.  */
    movs    a2, #0          /* Second arg: fill value */
    mov fp, a2          /* Null frame pointer */
    mov r7, a2          /* Null frame pointer for Thumb */

    ldr a1, .LC1        /* First arg: start of memory block */
    ldr a3, .LC2
    subs    a3, a3, a1      /* Third arg: length of block */

Once the arguments are loaded, we call memset (and switch to Thumb mode if appropriate):

#if __thumb__ && !defined(PREFER_THUMB)
    /* Enter Thumb mode.... */
    add a4, pc, #1  /* Get the address of the Thumb block */
    bx  a4      /* Go there and start Thumb decoding  */

    .code 16
    .global __change_mode
    .thumb_func
__change_mode:
#endif

    bl  FUNCTION (memset)

Target-Specific Initialization

Once the .bss section is cleared, optional target-specific early initialization is performed.

The startup code supports two weakly-linked functions:

.weak FUNCTION (hardware_init_hook)
    .weak FUNCTION (software_init_hook)

They are weakly-linked because they are optional. If a platform does not require this functionality the functions will not be defined and a value of 0 will be loaded for the variable. These functions are made available via the .Lhwinit and .Lswinit variables:

.Lhwinit:
    .word   FUNCTION (hardware_init_hook)
.Lswinit:
    .word   FUNCTION (software_init_hook)

The startup code checks whether these functions are defined, and calls them if they are.

ldr r3, .Lhwinit
    cmp r3, #0
    beq .LC24
    indirect_call r3
.LC24:
    ldr r3, .Lswinit
    cmp r3, #0
    beq .LC25
    indirect_call r3

argc and argv Initialization

The Newlib ARM startup code has a simple solution for argc and argv: they are initialized to 0:

.LC25:
    movs    r0, #0      /*  no arguments  */
    movs    r1, #0      /*  no argv either */

Call Global Constructors

Next, we call global constructors. The code is provisioned such that it will work If global constructors are not present. Constructors are enabled in our configuration.

First, we store the values of r0 and r1 to r4 and r5, since we will be calling other functions:

movs    r4, r0
    movs    r5, r1

First, we will register the _fini function (which is actually __libc_fini_array thanks to the preprocessor) with atexit. This ensures that global destructors will be run when exiting the program.

Newlib supports a "light exit" implementation, which is controlled by the _LITE_EXIT compiler definition. For embedded systems, this is a wonderful option. Our programs do not perform normal exit procedures; they simply run until power is removed. Cleaning up after the program is not a requirement, and exit functions can be discarded.

If _LITE_EXIT is enabled, atexit is weakly linked. If atexit is linked in our application, it will be called with __libc_fini_array as an argument. If it is not defined, the global destructors will not be registered. Our current configuration is using _LITE_EXIT without atexit.

#ifdef _LITE_EXIT
    /* Make reference to atexit weak to avoid unconditionally pulling in
       support code.  Refer to comments in __atexit.c for more details.  */
    .weak   FUNCTION(atexit)
    ldr r0, .Latexit
    cmp r0, #0
    beq .Lweak_atexit
#endif
    ldr r0, .Lfini
    bl  FUNCTION (atexit)

After the global destructors are registered, the _init function is invoked (which is actually __libc_init_array thanks to the preprocessor). This function calls the global constructors, and it is always run.

.Lweak_atexit:
    bl  FUNCTION (_init)

Once we have called the global constructors, the values for argc and argv are moved into the function argument registers r0 and r1 so we can call main:

movs    r0, r4
    movs    r1, r5

Call main

With the argc and argv function arguments stored in r0 and r1, we can safely call main:

bl  FUNCTION (main)

Program Exit

After main returns, exit is called using its return code. We do not expect exit to return, but if it does then we trap the program in SWI_Exit.

bl  FUNCTION (exit)     /* Should not return.  */

#if __thumb__ && !defined(PREFER_THUMB)
    /* Come out of Thumb mode.  This code should be redundant.  */

    mov a4, pc
    bx  a4

    .code 32
    .global change_back
change_back:
    /* Halt the execution.  This code should never be executed.  */
    /* With no debug monitor, this probably aborts (eventually).
       With a Demon debug monitor, this halts cleanly.
       With an Angel debug monitor, this will report 'Unknown SWI'.  */
    swi SWI_Exit
#endif

Now that we've looked over the _start function, let's look at the various functions that _start called.

__libc_init_array

The __libc_init_array() function can be found in newlib/libc/misc/init.c.

Depending on the architecture, compiler, and linker, constructors are placed into the .init_array section or the .init section. The Newlib ARM startup code is flexible and can handle any combination of cases. If HAVE_INITFINI_ARRAY is not defined, _start calls _init directly instead of calling __libc_init_array. If HAVE_INITFINI_ARRAY is defined, __libc_init_array calls the constructors in the .preinit_array and .init_array sections. If .init is also present for an architecture, the constructors stored in that section will also be invoked.

ARM code typically uses the __init_array instead of _init. In our current case, HAVE_INITFINI_ARRAY is defined and HAVE_INIT_FINI is not.

/* Handle ELF .{pre_init,init,fini}_array sections.  */
#include <sys/types.h>

#ifdef HAVE_INITFINI_ARRAY

/* These magic symbols are provided by the linker.  */
extern void (*__preinit_array_start []) (void) __attribute__((weak));
extern void (*__preinit_array_end []) (void) __attribute__((weak));
extern void (*__init_array_start []) (void) __attribute__((weak));
extern void (*__init_array_end []) (void) __attribute__((weak));

#ifdef HAVE_INIT_FINI
extern void _init (void);
#endif

/* Iterate over all the init routines.  */
void
__libc_init_array (void)
{
  size_t count;
  size_t i;

  count = __preinit_array_end - __preinit_array_start;
  for (i = 0; i < count; i++)
    __preinit_array_start[i] ();

#ifdef HAVE_INIT_FINI
  _init ();
#endif

  count = __init_array_end - __init_array_start;
  for (i = 0; i < count; i++)
    __init_array_start[i] ();
}
#endif

__libc_fini_array

The __libc_fini_array() function can be found in newlib/libc/misc/fini.c.

Depending on the architecture, compiler, and linker, destructors are placed into the .fini_array section or the .fini section. If the program is configured with full exit support, these functions will be executed before the program exits. In a LITE_EXIT configuration, the destructors are ignored.

Like __libc_init_array, the functionality is decided by two macros. If HAVE_INITFINI_ARRAY is not defined, _start registers _fini with atexit instead of __libc_fini_array. If HAVE_INITFINI_ARRAY is defined, the __libc_fini_array function is registered. When __libc_fini_array is invoked by exit, it calls the destructors in the .fini_array section. If .fini is also present for an architecture, the constructors stored in that section will also be invoked.

ARM code typically uses the __fini_array instead of _fini. In our current case, HAVE_INITFINI_ARRAY is defined and HAVE_INIT_FINI is not.

/* Handle ELF .{pre_init,init,fini}_array sections.  */
#include <sys/types.h>

#ifdef HAVE_INITFINI_ARRAY
extern void (*__fini_array_start []) (void) __attribute__((weak));
extern void (*__fini_array_end []) (void) __attribute__((weak));

#ifdef HAVE_INIT_FINI
extern void _fini (void);
#endif

/* Run all the cleanup routines.  */
void
__libc_fini_array (void)
{
  size_t count;
  size_t i;

  count = __fini_array_end - __fini_array_start;
  for (i = count; i > 0; i--)
    __fini_array_start[i-1] ();

#ifdef HAVE_INIT_FINI
  _fini ();
#endif
}
#endif

Heap Limit and malloc

The __heap_limit variable set during the _start routine is used by _sbrk, found in libgloss/arm/syscalls.c.

The _sbrk function is used to allocate memory for the platform. For more information heap allocation and sbrk, read this article about the glibc heap implementation.

While the _sbrk function is not directly used in the startup code, we can see that setting __heap_limit during _start is effectively configuring the program's heap. If the _start routine does not update __heap_limit, the default value is recognized and there will be no detection for allocations reaching beyond the heap limit.

/* Heap limit returned from SYS_HEAPINFO Angel semihost call.  */
uint __heap_limit = 0xcafedead;

void * __attribute__((weak))
_sbrk (ptrdiff_t incr)
{
  extern char end asm ("end"); /* Defined by the linker.  */
  static char * heap_end;
  char * prev_heap_end;

  if (heap_end == NULL)
    heap_end = & end;

  prev_heap_end = heap_end;

  if ((heap_end + incr > stack_ptr)
      /* Honour heap limit if it's valid.  */
      || (__heap_limit != 0xcafedead && heap_end + incr >
         (char *)__heap_limit))
    {
      errno = ENOMEM;
      return (void *) -1;
    }

  heap_end += incr;

  return (void *) prev_heap_end;
}

atexit Family

The atexit family of functions is responsible for registering functions to be called when the program exits, including the global destructors. We will explore the following functions:

We don't typically need exit functionality for our embedded platforms. Rarely is there a concept of a program "exit" which requires cleanup of resources. Instead, our programs run until they are terminated by a reset, off switch, or our of power.

Newlib provides for this behavior through the _LITE_EXIT compilation option. This option changes behavior related to the exit-time requirements and reduces our binary size. Our program is technically compiled under _LITE_EXIT, but we will still analyze the normal exit-related behavior for instructional purposes.

The Newlib code comments are helpful in explaining the differences between the two exit configurations. Under normal circumstances, we can expect the following exit call graphs ( an -> indicates "invokes"):

Default (without lite exit) call graph is like:
 *  _start -> atexit -> __register_exitproc
 *  _start -> __libc_init_array -> __cxa_atexit -> __register_exitproc
 *  on_exit -> __register_exitproc
 *  _start -> exit -> __call_exitprocs

When lite exit is enabled, the call graph changes. The atexit, __register_exitproc, and __call_exitprocs functions are changed to weak symbols, which may not be linked by the final program. These function call stacks are modified:

Lite exit makes some of above calls as weak reference, so that size
expansive  functions __register_exitproc and __call_exitprocs may 
not be linked. These calls are:
 *    _start w-> atexit
 *    __cxa_atexit w-> __register_exitproc
 *    exit w-> __call_exitprocs

Let's look at how these exit functions operate.

atexit

The atexit function is used to register calls that should be invoked when the program exits. Most notably, this call is used to register the function in .fini or .fini_array during the startup process. If the _LITE_EXIT configuration is used, this function step will be avoided.

The atexit function is implemented in newlib/libc/stdlib/atexit.c. This implementation forwards the input function argument to __register_exitproc while noting that the call originated from atexit (using the __et_atexit argument).

#include <stdlib.h>
#include "atexit.h"

int
atexit (void (*fn) (void))
{
  return __register_exitproc (__et_atexit, fn, NULL, NULL);
}

__cxa_atexit

The __cxa_atexit call is used similarly to atexit, but often for handling functions to be called when a dynamic library is unloaded. In many implementations, such as this one, atexit and __cxa_atexit share implementations.

The __cxa_atexit function is implemented in newlib/libc/stdlib/cxa_atexit.c. This implementation forwards the input function and arguments to __register_exitproc while indicating that the call originated from __cxa_atexit (using the __et_cxa argument).

If the _LITE_EXIT configuration is used, then __register_exitproc may be weakly linked. In this case, __cxa_atexit will blindly return success (0).

int __cxa_atexit (void (*fn) (void *), void *arg, void *d)
{
#ifdef _LITE_EXIT
  /* Refer to comments in __atexit.c for more details of lite exit.  */
  int __register_exitproc (int, void (*fn) (void), void *, void *)
    __attribute__ ((weak));

  if (!__register_exitproc)
    return 0;
  else
#endif
    return __register_exitproc (__et_cxa, (void (*)(void)) fn, arg, d);
}

__register_exitproc

We've seen two uses of __register_exitproc, the common routine that handles all atexit-like functionality. __register_exitproc is called when the program exits or when a shared library is unloaded.

The __register_exitproc function is implemented in newlib/libc/stdlib/__atexit.c. This function must support a variety of configurations and behaviors: _LITE_EXIT vs standard exit, single-threaded vs multi-threaded, atexit vs __cxa_atexit. I've stripped out some of the #ifdef blocks to make the code more readable.

The function starts by acquiring a lock if threading is enabled:

int __register_exitproc (int type, void (*fn) (void), void *arg, void *d)
{
  struct _on_exit_args * args;
  register struct _atexit *p;

#ifndef __SINGLE_THREAD__
  __lock_acquire_recursive(__atexit_recursive_mutex);
#endif

And we grab our _GLOBAL_ATEXIT list of functions. If this has not been initialized yet, we assign it to the initial list value.

p = _GLOBAL_ATEXIT;
if (p == NULL)
{
    _GLOBAL_ATEXIT = p = _GLOBAL_ATEXIT0;
}

By default, atexit requires the C runtime to support registering at least 32 functions (_ATEXIT_SIZE). Newlib handles this by allocating 32-chunk blocks of memory. Once the current block is full, a new block will be allocated and added to the list

If there is no malloc implementation for the system, or if dynamic allocations for atexit are not allowed, the function will fail and return an error code instead of allocating a new block.

if (p->_ind >= _ATEXIT_SIZE)
    {
#if !defined (_ATEXIT_DYNAMIC_ALLOC) || !defined (MALLOC_PROVIDED)
#ifndef __SINGLE_THREAD__
      __lock_release_recursive(__atexit_recursive_mutex);
#endif
      return -1;
#else
      p = (struct _atexit *) malloc (sizeof *p);
      if (p == NULL)
    {
#ifndef __SINGLE_THREAD__
      __lock_release_recursive(__atexit_recursive_mutex);
#endif
      return -1;
    }
      p->_ind = 0;
      p->_next = _GLOBAL_ATEXIT;
      _GLOBAL_ATEXIT = p;
      p->_on_exit_args_ptr = NULL;
#endif
    }

We observed two different type values for this call: __et_atexit and __et_cxa. If __cxa_atexit was called, additional arguments were provided and need to be stored for future retrieval. Arguments and function pointers are stored in the current index, and then it is incremented.

if (type != __et_atexit)
{
    args = &p->_on_exit_args;
    args->_fnargs[p->_ind] = arg;
    args->_fntypes |= (1 << p->_ind);
    args->_dso_handle[p->_ind] = d;
    if (type == __et_cxa)
        args->_is_cxa |= (1 << p->_ind);
}
p->_fns[p->_ind++] = fn;

Once we are done, we can unlock and exit the function:

#ifndef __SINGLE_THREAD__
  __lock_release_recursive(__atexit_recursive_mutex);
#endif
  return 0;
}

Automatic Registration of Destructors

One interesting note is that Newlib provides features for registering global destructors (in .fini or .fini_array) within the C library, rather than in startup code. This automatic registration code is provided in newlib/libc/stdlib/__call_atexit.c.

A __libc_fini symbol is weakly defined. You can define __libc_fini to _fini or _fini_array in your linker script, and the C library will handle the registration so that your startup code does not need to call atexit.

extern char __libc_fini __attribute__((weak));

A registration function is defined and marked as a high-priority constructor, which places it into the .init or .init_array section. Since destructors are stored in LIFO order, and the .fini and .fini_array functions should run last, the constructor is attempting to be the first to register with atexit.

static void register_fini(void) __attribute__((constructor (0)));

The register function checks for a valid __libc_fini symbol and registers the destructors if its defined.

static void 
register_fini(void)
{
  if (&__libc_fini) {
#ifdef HAVE_INITFINI_ARRAY
    extern void __libc_fini_array (void);
    atexit (__libc_fini_array);
#else
    extern void _fini (void);
    atexit (_fini);
#endif
  }
}

exit Family

To complete our analysis of _start and crt0.s, we'll look at the exit family of functions:

exit

The exit function is implemented in newlib/libc/stdlib/exit.c.

The Newlib exit function is a wrapper. exit calls all registered exit-time functions via __call_exitprocs. If the _LITE_EXIT configuration is used, this function may not be defined.

Following the invocation of exit-time destructors, a _GLOBAL_REEINT->__cleanup function is called. This function flushes stdio buffers, if necessary.

Once all destruction and cleanup activities are complete, control proceeds to _exit.

void exit (int code)
{
#ifdef _LITE_EXIT
  /* Refer to comments in __atexit.c for more details of lite exit.  */
  void __call_exitprocs (int, void *) __attribute__((weak));
  if (__call_exitprocs)
#endif
    __call_exitprocs (code, NULL);

  if (_GLOBAL_REENT->__cleanup)
    (*_GLOBAL_REENT->__cleanup) (_GLOBAL_REENT);
  _exit (code);
}

__call_exitprocs

The __call_exitprocs function is responsible for calling exit-time destructor routines that were registered with the atexit famil of functions. __call_exitprocs is implemented in newlib/libc/stdlib/__call_atexit.c. I've stripped out some of the #ifdef blocks to make the code more readable.

The function starts by acquiring a lock if threading is enabled:

void  __call_exitprocs (int code, void *d)
{
  register struct _atexit *p;
  struct _atexit **lastp;
  register struct _on_exit_args * args;
  register int n;
  int i;
  void (*fn) (void);

#ifndef __SINGLE_THREAD__
  __lock_acquire_recursive(__atexit_recursive_mutex);
#endif

Next the linked-list of exit-time functions is accessed. Note the restart label, as it will be referenced later.

restart:
  p = _GLOBAL_ATEXIT;
  lastp = &_GLOBAL_ATEXIT;

For each entry in the list, the following actions are performed:

  • Arguments are loaded
  • The function is removed from the list
  • The index is decremented
  • If unloading a shared library, check that the _dso_handle matches the unloaded library
    • Skip to the next entry if there is a mismatch
  • Check if the function has been called
    • Skip to the next entry if it has already been called
  • Call the function

The loop also checks the index after calling the destructor. If that function registered new exit-time functions, the loop jumps back to restart to ensure to preserve the destructor LIFO order.

while (p)
    {
      args = &p->_on_exit_args;
      for (n = p->_ind - 1; n >= 0; n--)
    {
      int ind;

      i = 1 << n;

      /* Skip functions not from this dso.  */
      if (d && (!args || args->_dso_handle[n] != d))
        continue;

      /* Remove the function now to protect against the
         function calling exit recursively.  */
      fn = p->_fns[n];
      if (n == p->_ind - 1)
        p->_ind--;
      else
        p->_fns[n] = NULL;

      /* Skip functions that have already been called.  */
      if (!fn)
        continue;

      ind = p->_ind;

      /* Call the function.  */
      if (!args || (args->_fntypes & i) == 0)
        fn ();
      else if ((args->_is_cxa & i) == 0)
        (*((void (*)(int, void *)) fn))(code, args->_fnargs[n]);
      else
        (*((void (*)(void *)) fn))(args->_fnargs[n]);

      /* The function we called call atexit and registered another
         function (or functions).  Call these new functions before
         continuing with the already registered functions.  */
      if (ind != p->_ind || *lastp != p)
        goto restart;
    } // end of for - while still in effect

At the end of each block of exit-functions, the now-empty block is removed from the list and the memory is freed. If malloc is not provided or dynamic allocations in atexit are disallowed, the function ends after the first block.

// while still in effect
#if !defined (_ATEXIT_DYNAMIC_ALLOC) || !defined (MALLOC_PROVIDED)
      break;
#else
      /* Move to the next block.  Free empty blocks except the last one,
     which is part of _GLOBAL_REENT.  */
      if (p->_ind == 0 && p->_next)
    {
      /* Remove empty block from the list.  */
      *lastp = p->_next;
      free (p);
      p = *lastp;
    }
      else
    {
      lastp = &p->_next;
      p = p->_next;
    }
#endif
    } // end of while

The lock is released, and the function exits.

#ifndef __SINGLE_THREAD__
  __lock_release_recursive(__atexit_recursive_mutex);
#endif
}

_exit

The _exit function is found at libgloss/arm/_exit.c. This function is simply a wrapper around _kill_shared.

void _exit (int status)
{
  /* The same SWI is used for both _exit and _kill.
     For _exit, call the SWI with "reason" set to 
    ADP_Stopped_ApplicationExit to mark a standard exit.
     Note: The RDI implementation of _kill_shared throws away all its
     arguments and all implementations ignore the first argument.  */
  _kill_shared (-1, status, ADP_Stopped_ApplicationExit);
}

_kill

The _kill_shared function is implemented in libgloss/arm/_kill.c.

When we remove the Semihosting / debug montior suport, this function does nothing:

int _kill_shared (int pid, int sig, int reason)
{
  (void) pid; (void) sig;

  __builtin_unreachable();
}

When debug monitor support is included, the __builtin_unreachable() call makes sense, because the debug monitor will trap the code in an SWI handler. If we have compiled without debug monitor support, this function will return up the call stack to crt0.s, and we will invoke the SWI handler anyway:

swi    SWI_Exit

Visual Summary


Startup Activity Checklist

In the first article of this series, we reviewed a broad range of startup activities that occur before main is called.

Here is a checklist of actions that were observed in the Newlib ARM program startup procedures:

  • [x] Early low-level initialization of the processor/hardware
  • [x] Stack initialization
  • [x] Frame pointer initialization
  • [x] C/C++ runtime setup
    • [x] Handle relocations (some sections are copied from flash to RAM)
    • [x] Initialize .bss
    • [x] Call global constructors
    • [x] Prepare argc, argv (set to 0)
    • [ ] Prepare environment variables
    • [x] Heap initialization
    • [ ] stdio initialization
    • [ ] Initialize exception support
    • [x] Register destructors and other exit-time functionality
  • [ ] System scaffolding setup
    • [ ] Threading support
    • [ ] Thread local storage
    • [ ] Buffer overrun detection
    • [ ] Run-time error checks
    • [ ] Locale settings
    • [ ] Math error handling
    • [ ] Math precision
  • [x] Jump to main
  • [x] Exit after main

Related Articles

A General Overview of What Happens Before main()

For most programmers, a C or C++ program's life begins at the main function. They are blissfully unaware of the hidden steps that happen between invoking a program and executing main. Depending on the program and the compiler, there are all kinds of interesting functions that get run before main, automatically inserted by the compiler and linker and invisible to casual observers.

Unfortunately for programmers who are curious about the program startup process, the literature on what happens before main is quite sparse.

Embedded Artistry has been hard at working creating a C++ embedded framework. The final piece of the puzzle was implementing program startup code. To aid in the design of our framework's boot process, I performed an exploratory survey of existing program startup implementations. My goal is to identify a general program startup model. I also want to provide a more comprehensive look into how our programs get to main.

In this six-part series, we will be investigating what it takes to get to main:

  1. A General Overview of What Happens Before main()
  2. Exploring Startup Implementations: Newlib (ARM)
  3. Exploring Startup Implementations: OS X
  4. Exploring Startup Implementations: Custom Embedded System with ThreadX
  5. Abstracting a Generic Flow for Getting to main
  6. Implementing our Generic Startup Flow

To begin our investigation, we will provide a summary of what happens in a program before main. The steps and responsibilities we describe are generalized so that they apply to most systems. We will supplement the general theory in the following articles with an analysis of real-world implementations.

Table of Contents:

  1. Getting to Main: A General Overviewy
    1. The _start Function
    2. Runtime Setup
    3. Other Scaffolding
    4. Jumping to main
    5. Returning from main
  2. How Do We Get to _start?
    1. Baremetal: reset vector
    2. Bootloader launches application
    3. OS Calls an exec function
  3. Exploring On Your Own
  4. Further Reading

Getting to Main: A General Overview

Before we dive into our exploration of how existing systems get to main, we should develop a hypothesis about what generally happens. Since others have already explored program startup, we can start with a clear idea of what happens before main.

The _start Function

For most C and C++ programs, the true entry point is not main, it's the _start function. This function initializes the program runtime and invokes the program's main function.

The use of _start is merely a general convention. The entry function can vary depending on the system, compiler, and standard libraries. For example, OS X only has dynamically linked applications; the loader takes care of setup, and the entry point to the program is actually main.

The linker controls the program's entry point. The default entry point can be overridden by clang and GCC linkers using the -e flag, although this is rarely done for most programs.

The implementation of the _start function is usually supplied by libc. The _start function is often written in assembly. Many implementations store the _start function in a file called crt0.s. Compilers typically ship with pre-compiled crt0.o object files for each supported architecture.

Program startup code behavior is not specified by the C and C++ standards. Instead, the standards describe the conditions that must be true when the main function is called. However, there are many steps that are commonly performed across the majority of _start implementations.

At a high level, the _start function handles:

  1. Early low-level initialization, such as:
    1. Configuring processor registers
    2. Initializing external memory
    3. Enabling caches
    4. Configuring the MMU
  2. Stack initialization, making sure that the stack is properly aligned per the ABI requirements
  3. Frame pointer initialization
  4. C/C++ runtime setup
  5. Initializing other scaffolding required by the system
  6. Jumping to main
  7. Exiting the program with the return code from main

While the _start routine typically encompasses these activities, the specific order and implementation varies from system to system. For example, early low-level initialization code is commonly found with bare-metal embedded systems, but rarely on host machines with an OS. Your Linux or OS X program startup code will have multiple scaffolding functions which you will not find in embedded startup code.

Let's take a look at a simple implementation of an x86_64 _start function taken from the OS Dev wiki. This example provides us with a preview of the basic skeleton for program startup. The implementations we will review later in this series are much more complex.

The startup code below assumes that the program loader put:

  • *argv and *envp variables on the stack
  • argc in register %rdi
  • argv in register %rsi
  • envc in register %rdx
  • envp in register %rcx

Here's the implementation of _start:

.section .text

.global _start
_start:
    # Set up end of the stack frame linked list
    movq $0, %rbp
    pushq %rbp # rip=0
    pushq %rbp # rbp=0
    movq %rsp, %rbp

    # Save argc and argv on the stack
    # We need those in a moment when we call main
    pushq %rsi
    pushq %rdi

    # Prepare signals, memory allocation, stdio, etc.
    call initialize_standard_library

    # Run the global constructors.
    call _init

    # Restore argc and argv before calling main
    popq %rdi
    popq %rsi

    # Run main
    call main

    # Terminate the process with the exit code 
    # that was returned from main
    movl %eax, %edi
    call exit

Let's dive in and see what happens during the runtime setup process (initialize_standard_library above).

Runtime Setup

C/C++ runtime setup is a universal requirement for program startup. At a high level, our runtime setup must accomplish the following:

  1. Relocate any relocatable sections (if not handled by the loader or linker)
  2. Initializing global and static memory
  3. Prepare the argc and argv variables for invoking main (even if it's just setting these to 0/NULL)

Initializing global and static memory is broken down into two distinct steps that deserve additional details.

First, the runtime initializes a subset of uninitialized memory (no = in the declaration) to 0. This includes global and static variables, but not stack variables. All uninitialized data that needs to be set to 0 is placed into the .bss section of the compiled program image by the linker. The location of the .bss section is identified during initialization, and the memory is typically set to 0 with memset.

Second, C++ global objects must be properly constructed before calling main. The linker places these constructors into the .init, .init_array, or .ctors section of the image. Some compilers also allow C and C++ functions to be marked as a constructor using a compiler attribute (e.g., __attribute__((constuctor))). The constructors are stored in a list by the linker. The runtime initialization process iterates through the list and calls each constructor.

These additional runtime initialization steps are run for most programs (but not all):

  1. Heap initialization
  2. Initialize stdio (i.e., stdin,stdout,stderr`)
  3. Initialize exception support
  4. Register destructors and other cleanup functions that will run when exiting the program (using atexit and __cxa_atexit)
  5. Prepare environment variables

In practice, the line between the responsibilities of _start and the C runtime initialization can be fuzzy. Some implementations of _start handle pieces of the runtime setup directly, such as setting the .bss section contents to 0 and calling global constructors. Other implementations implement those tasks in the runtime setup routines.

Assembly files commonly found during this portion of the startup process are crtbegin.s, crtend.s, crti.s, and crtn.s. Compilers often ship pre-compiled object files for supported architectures. These files are related to calling global constructors and destructors. When the files are not used, equivalent functionality is often implemented in C and invoked during runtime initialization.

Other Scaffolding

The setup process may invoke other functions to set up program scaffolding that the system requires. Program scaffolding setup before main might include:

  1. Threading support and thread local storage
  2. Buffer overrun detection
  3. Stack logging
  4. Run-time error checks
  5. Locale settings
  6. Math error handling
  7. Default math library precision

The specific scaffolding functions invoked vary across standard library implementations and operating systems.

Jumping to main

Once we have a fully initialized system, we can safely jump to main and execute the programmer's portion of the application.

The most important aspect: once the program reaches main, it must be in a standards-conforming state. Otherwise, the program's assumptions will be invalidated.

Returning From main

While we were primarily interested in how we get to main, we should finish our explanation of the _start function's responsibilities.

Because _start invokes main, it also handles its return. When control returns from main to _start, the next function to run is exit. The exit function calls all functions registered with atexit and __cxa_atexit during the startup process. Then exit calls the global destructors (those placed in the .fini, .fini_array, or .dtors sections). Finally, exit terminates the program with the return value provided by main.

The exit function is primarily used for hosted programs. Bare metal programs rarely have use for the exit function or global destructors.

How Do We Get to _start?

Now that we know how our program gets to main by way of the _start function, you may wonder how the program gets to _start.

There are three common paths:

  1. Baremetal: reset vector
  2. Bootloader launches application
  3. OS Calls an exec function

Baremetal: Reset Vector

A baremetal embedded application represents the simplest path to _start.

Consider a baremetal platform with a binary stored in flash memory. When power is applied to the processor, the processor will copy the program from flash and store it in RAM1. Once the program is loaded into memory, the processor jumps to the reset interrupt vector address.

The embedded program's reset interrupt handler initializes the system after power-on or reset. The reset handler typically performs an initial configuration of the processor registers and critical hardware components (such as external RAM, caches, or MMU). Once this initial configuration is complete, the reset handler jumps to _start.

Some systems do not utilize the C standard library, and in that case _start will not be called. Instead, the reset handler will invoke other setup functions or will directly execute necessary program setup steps.

1: If the chip supports execute-in-place (XIP), the processor will skip the copy step and run the program directly from flash memory.

Bootloader Launches Application

Many embedded applications are composed of multiple distinct images which run sequentially during the boot process.

Many systems use a bootloader or hypervisor, which runs before loading and executing the main application. Bootloaders perform a wide range of activities, including initializing system hardware, decryption, decompression, checking that a firmware image is valid before loading it, selecting a firmware image to boot, or determining whether to enter firmware update mode. Bootloader complexity depends on the system's requirements; not of the listed tasks tasks will be performed.

Other systems require an incremental boot process, especially when the main application is larger than the processor's internal RAM capacity. The first boot stage is typically a small image which fits into the processor's internal memory. This image will initialize external RAM and load the main application from flash into the external RAM. The first stage boot may perform additional steps, such as processor vector remapping or MMU configuration. Once the main application is loaded, the first stage boot invokes the reset vector of the main application.

Multi-stage boot scenarios complicate the program startup model. Each boot stage is technically a standalone program. However, not every stage will run through the full program startup process. Simple boot stages may only need to clear the .bss section to perform their duties, while complex bootloaders need a fully initialized C/C++ runtime. Program startup activities may be distributed across the boot process, with each stage handling specific tasks.

OS Calls an exec function

The most complex scenario is running a program on a host machine with a fully-fledged OS.

When you launch a program, your shell or GUI invokes a program loader. The loader is responsible for copying the application image from the hard drive into memory and configuring the environment that the program will run in. On Linux or OS X, the loader is a function in the exec() family typically execve() or execvp(). For Windows, the loader is the LdrInitializeThunk function in ntdll.dll.

Loaders will often perform the following actions:

  • Check permissions
  • Allocate space for the program's stack
  • Allocate space for the program's heap
  • Initialize registers (e.g., stack pointer)
  • Push argc, argv, and envp onto the program stack
  • Map virtual address spaces
  • Dynamic linking
  • Relocations
  • Call pre-initialization functions

Once the loader has configured the program environment, it calls the program's _start function.

Exploring On Your Own

In the next three articles, we will review a selection of startup procedures which differ greatly in terms of process and style:

  1. Newlib (ARM)
  2. OS X
  3. Custom Embedded System with ThreadX

We won't be reviewing Linux program startup, because there are already high-quality articles on that topic. For detailed descriptions about how Linux programs start, we recommend these articles:

  1. How Programs Get Run
  2. How Programs Get Run: ELF Binaries
  3. Linux x86 Program Start Up or - How the heck do we get to main()?

The startup code that your system runs is supplied by your libc implementation and system libraries, and the implementations will also vary depending on the target architecture. Don't be surprised if you find a different startup process than those described in this series and in other articles around the web.

You can explore your own program's startup behavior using objdump or a debugger (I.e. gdb, lldb). You can use debugging tools to tackle the problem from a variety of directions:

  1. Set a breakpoint at main() and run a backtrace to see the function call stack
  2. Set a breakpoint at _start() (or whatever entry point your backtrace shows) and step through the execution
  3. Dump the assembly output for the program using objdump

As Daniel Näslund pointed out in the comments, your debugger may be configured to suppress backtraces that go past the main function. For gdb, you can run the following command:

(gdb) set backtrace past-main on

Further Reading

Related Articles

What can software organizations learn from the Boeing 737 MAX saga?

One of the largest news stories over the past month was the grounding of Boeing 737 MAX-8 and MAX-9 aircraft after an Ethiopian Airlines crash resulted in the deaths of everyone on board. This is the second deadly crash of involving a Boeing 737 MAX. A Lion Air Boeing 737 MAX-8 crashed in October 2018, also killing everyone on board. As a result of these two crashes, Boeing 737 MAX airplanes have been temporarily grounded in over 41 countries, including China, the US, and Canada. Boeing also paused delivery of these planes, although they are continuing to produce them.

I have been following the Boeing 737 MAX story closely. It serves as an interesting case study on software and systems engineering, human factors, corporate behavior, and customer service.

Note: Both the Lion Air and Ethiopian Airlines crashes are still under investigation. Ultimately, everything you are reading about these crashes and that I discuss here is still in the realm of speculation. However, the situation is serious enough and well-enough understood that Boeing is addressing the problem immediately.

Table of Contents:

Brief Background on the 737 MAX

Before diving into the suspected problem with the 737 MAX, I need to set the stage with some background information about the aircraft.

The Boeing 737 is the best-selling aircraft in the world, with over 15,000 planes sold. After Airbus announced an upgrade to the A320 that provided 14% better fuel economy per seat, Boeing responded with the 737 MAX. Boeing sold the 737 MAX as an "upgrade" to the famed 737 design, using larger engines for improved fuel efficiency (also by 14%). Boeing claimed that the 737 MAX operated and flew in the same way as the 737 NG, so pilots licensed to fly the 737 NG did not need additional training and simulator time for the 737 MAX.

Because Boeing increased the engine size to improve fuel efficiency, the engines needed to be positioned higher up on the plane's wings and slightly forward of the old position. Higher nose landing gear was also added to provide the same ground clearance as the 737NG.

The larger engines and new positions destabilized the aircraft, but not under all conditions. The engine housings were designed so they do not generate lift in normal flight. However, if the airplane is in a steep pitch (e.g., takeoff or a hard turn), the engine housings generate more lift than on previous 737 models. Depending on the angle, the airplane's inertia can cause the plane to over-swing into a stall.

To address the increased stall risk, Boeing developed a software solution: the Maneuvering Characteristics Augmentation System (MCAS). No other commercial plane uses a system like the MCAS, though Boeing uses a similar MCAS system on the KC-46 Pegasus military aircraft.

The MCAS is part of the flight management computer software. The pilot and co-pilot each have their own flight computer, but only one has control at a time. The MCAS takes readings from the angle of attack (AoA) sensor to determine how the plane's nose is pointed relative to the oncoming air. The MCAS monitors airspeed, altitude, and AoA. When the MCAS determines that the angle of attack is too great, it automatically performs two actions to prevent a stall:

  1. Command the aircraft's trim system to adjust the rear stabilizer and lower the nose
  2. Push the pilot's yoke in the down direction

The movement of the rear stabilizer varies with the speed of the plane. The stabilizer moves more at slower speeds and less at higher speeds.

By default, the MCAS is active when:

  • AoA is high (ascent, steep turn)
  • Autopilot is off
  • Flaps are up

The MCAS will deactivate once:

  • The AoA measurement is below the target threshold
  • The pilot overrides the system with a manual trim setting
  • The pilot engages the CUTOUT switch, which disables automatic control of the stabilizer trim

If the pilot overrides the MCAS with trim controls, it will activate again within five seconds after the trim switches are released if the sensors still detect an AoA over the threshold. The only way to completely disable the system is to use the CUTOUT switch and take manual trim control.

Note this important point: Boeing designed the MCAS to not turn off in response to a pilot manually pulling the yoke. Doing so would defeat the original purpose of the MCAS, which is to prevent the pilot from inadvertently entering a stall angle.

I highlight this point because a natural reaction to a plane that is pitching downward is to pull on the yoke. You are applying a counter-force to correct for the unexpected motion. For normal autopilot trim or runaway manual trim, pulling on the yoke does what you expect and triggers trim hold sensors.

We are under the impression that the column, yoke, steering wheel, gas pedal, and brakes fully control the response of the mechanical system. This is an illusion. Modern aircraft, like most modern cars, are "fly-by-wire". Gone are the days of direct mechanical connections involving cables and hydraulic lines. Instead, most of the connections are purely electrical and typically mediated by a computer. In many ways we are being continually "guarded" by the computers that mediate these connections. It can be a terrible shock when the machine fights against you.

The Suspected Problem

The MCAS is suspected to have played a significant role in both crashes.

During Lion Air flight JT610, MCAS repeatedly forced the plane's nose down, even when the plane was not stalling. The pilots tried to correct by pointing the nose higher, but the system kept pushing it down again. This up-and-down oscillation happened 21 times before the crash occurred. The Ethiopian Airlines crash shows a similar pattern. The Ethiopian Airlines CEO said that they believed that the MCAS was active during the Ethiopian Airlines crash.

Image from the Lion Air crash preliminary report. Notice how the Automatic Trim (yellow line) was forcing the aircraft down, and the pilots countered by pointing it back up (light blue line above Automatic Trim).

If the plane wasn't actually stalling, or even close to a stall angle, why was MCAS engaged?

AoA sensors can be unreliable, which is a suggested factor in the Lion Air crash, where there was a 20-degree discrepancy in AoA sensor readings. The MCAS only reads the AoA sensor on its corresponding side of the plane. The MCAS reacts to the reading faithfully and does not cross-check the other sensor to confirm the reading. If a sensor goes haywire, the MCAS has no way of knowing.

If the MCAS was enabled erroneously, why did the pilots not disable the system?

This is where the situation becomes muddled. The likeliest explanation for the Lion Air pilots is that they had no idea that the MCAS existed, that it was active, or how they could disable it.

Remember, the MCAS is a unique piece of software among commercial airplanes; it only runs on the 737 MAX. Boeing sold and certified the 737 MAX as a minor upgrade to the 737 body, which would not require pilots to re-certify or spend time training in simulators. As a result, it seems that the existence of the MCAS was largely kept quiet.

“We do not like the fact that a new system was put on the aircraft and wasn’t disclosed to anyone or put in the manuals."

  • Jon Weaks, president of Southwest Airlines Pilots Association

"This is the first description you, as 737 pilots, have seen. It is not in the AA 737 Flight Manual Part 2, nor is there a description in the Boeing FCOM (flight crew operations manual). It will be soon."

  • Message to APA from Capt. Mike Michaelis

After the Lion Air crash, Boeing released a bulletin providing details on how the system worked and how to counter-act it in case of malfunction. Boeing announced that the MCAS could move the stabilizer by 2.5 degrees. This movement limit applies separately for each time the MCAS is activated. Boeing confirmed that MCAS can move the stabilizer to its full downward position if the pilot did not counteract it with manual trimming or completely cutting out the system. With a limit of 2.5 degrees, two cycles of MCAS without pilot correction is enough to reach full downward position.

Boeing also said that emergency procedures that applied to earlier 737 models would have corrected the problems observed in the Lion Air crash.

The Lion Air pilots likely fought against an automated system that was working against them. The system is most likely to activate at low altitudes, such as during takeoff, leaving the pilots little time to react. Their search through the technical manuals proved unsuccessful.

The Ethiopian Airlines pilots had heard about MCAS thanks to the bulletin, although one pilot commented, "we know more about the MCAS system from the media than from Boeing". Ethiopian Airlines installed one of the first simulators for the 737 MAX, but the pilot of the doomed flight had not yet received training in the simulator. All we know at this time is that the pilot reported "flight control problems" and wanted to return to the airport and that the Ethiopian Airlines crash resembles the Lion Air crash. We must wait for the preliminary report for more details.

Compounding Factors

Based on our current knowledge, the first-level analysis leads us to believe that the MCAS system was poorly designed and caused two plane crashes.

It's not quite that simple. This is a complex situation, involving many people and organizations. Other pilots have struggled against the MCAS system and safely guided their passengers to their destination.

The following contributing factors play out time-and-again in other systems.

Poor Documentation

As I mentioned, after the Lion Air crash, pilots complained that they were not told about the MCAS or trained in how to respond when the system engages unexpectedly. This lack of documentation or training is especially dangerous when you are fighting against an automated system and your previous training does not fully apply (recall that pulling on the yoke to hold against the trim does not work against the MCAS). Even worse, Lion Air pilots attempted to find answers in their manuals before they crashed.

Pilots take their documentation extremely seriously. Below are three reports from the Aviation Safety Reporting System (ASRS), which is run by NASA to provide pilots and crews with a way to report safety issues confidentially.

The reports highlighted below focus on the insufficiency of Boeing 737 MAX documentation. I've bolded some sentences for emphasis.

ACN 1593017

Synopsis:

B737MAX Captain expressed concern that some systems such as the MCAS are not fully described in the aircraft Flight Manual.

Highlights from the narrative:

This description is not currently in the 737 Flight Manual Part 2, nor the Boeing FCOM, though it will be added to them soon. This communication highlights that an entire system is not described in our Flight Manual. This system is now the subject of an AD.

I think it is unconscionable that a manufacturer, the FAA, and the airlines would have pilots flying an airplane without adequately training, or even providing available resources and sufficient documentation to understand the highly complex systems that differentiate this aircraft from prior models. The fact that this airplane requires such jury rigging to fly is a red flag. Now we know the systems employed are error prone--even if the pilots aren't sure what those systems are, what redundancies are in place, and failure modes.

I am left to wonder: what else don't I know? The Flight Manual is inadequate and almost criminally insufficient. All airlines that operate the MAX must insist that Boeing incorporate ALL systems in their manuals.

ACN 1593021

Synopsis:

B737MAX Captain reported confusion regarding switch function and display annunciations related to "poor training and even poorer documentation".

Highlights from narrative:

This is very poorly explained. I have no idea what switch the preflight is talking about, nor do I understand even now what this switch does.

I think this entire setup needs to be thoroughly explained to pilots. How can a Captain not know what switch is meant during a preflight setup? Poor training and even poorer documentation, that is how.

It is not reassuring when a light cannot be explained or understood by the pilots, even after referencing their flight manuals. It is especially concerning when every other MAINT annunciation means something bad. I envision some delayed departures as conscientious pilots try to resolve the meaning of the MAINT annunciation and which switches are referred to in the setup.

ACN 1555013

Synopsis:

B737 MAX First Officer reported feeling unprepared for first flight in the MAX, citing inadequate training.

Highlights from narrative:

I had my first flight on the Max [to] ZZZ1. We found out we were scheduled to fly the aircraft on the way to the airport in the limo. We had a little time [to] review the essentials in the car. Otherwise we would have walked onto the plane cold.

My post flight evaluation is that we lacked the knowledge to operate the aircraft in all weather and aircraft states safely. The instrumentation is completely different - My scan was degraded, slow and labored having had no experience w/ the new ND (Navigation Display) and ADI (Attitude Director Indicator) presentations/format or functions (manipulation between the screens and systems pages were not provided in training materials. If they were, I had no recollection of that material).

We were unable to navigate to systems pages and lacked the knowledge of what systems information was available to us in the different phases of flight. Our weather radar competency was inadequate to safely navigate significant weather on that dark and stormy night. These are just a few issues that were not addressed in our training.

Even worse, it appears that the FAA's System Safety Analysis document was also incorrect:

The original Boeing document provided to the FAA included a description specifying a limit to how much the system could move the horizontal tail — a limit of 0.6 degrees, out of a physical maximum of just less than 5 degrees of nose-down movement. [...] That limit was later increased after flight tests showed that a more powerful movement of the tail was required to avert a high-speed stall, when the plane is in danger of losing lift and spiraling down.

After the Lion Air Flight 610 crash, Boeing for the first time provided to airlines details about MCAS. Boeing’s bulletin to the airlines stated that the limit of MCAS’s command was 2.5 degrees. That number was new to FAA engineers who had seen 0.6 degrees in the safety assessment.

“The FAA believed the airplane was designed to the 0.6 limit, and that’s what the foreign regulatory authorities thought, too,” said an FAA engineer. “It makes a difference in your assessment of the hazard involved.”

I understand the pilots' concern, given that the MCAS could move the tail 4x farther than stated in the official safety analysis. What else is undocumented or documented incorrectly?

Rushed Release

I would bet that all engineers are familiar with rushed releases. We cut corners, make concessions, and ignore or mask problems - all so we can release a product by a specific date. Any problems are downplayed, and those that are observed by the customer can be fixed later in a patch.

Apparently, the 737 MAX was subject to the same treatment. Here are some key highlights from the article:

  • The FAA delegates some certification and technical assessments to airplane manufacturers, citing lack of funding and resources to carry out all operations internally
    • FAA managers have final authority on what gets delegated to the manufacturer
  • Boeing was under time pressure, because development of the 737 MAX was nine months behind the new A320neo
  • FAA technical experts said in interviews that managers prodded them to speed up the process
  • FAA safety engineer who was involved with certifying the 737 MAX was quoted saying that halfway through the certification process:
    • “We were asked by management to re-evaluate what would be delegated. Management thought we had retained too much at the FAA.”
    • “There was constant pressure to re-evaluate our initial decisions. And even after we had reassessed it […] there was continued discussion by management about delegating even more items down to the Boeing Company.”
    • “There wasn’t a complete and proper review of the documents. Review was rushed to reach certain certification dates.”
  • If there wasn't time for FAA staff to complete a review, FAA manages either signed off on the documents themselves or delegated the review to Boeing
  • As a result of this rushed process, a major change slipped through the process:
    • The System Safety Analysis on MCAS claims that the horizontal tail movement is limited to 0.6 degrees
    • This number was found to be insufficient for preventing a stall in worst-case scenarios
    • The number was increased 4x to 2.5 degrees
    • The FAA was never told about this changed, and FAA engineers did not learn about it until Boeing released the MCAS bulletin following the Lion Air crash

The New York Times corroborates this rushed released:

  • "The pace of the work on the 737 Max was frenetic, according to current and former employees who spoke with The New York Times."
    • “The timeline was extremely compressed,” the engineer said. “It was go, go, go.”
  • "One former designer on the team working on flight controls for the Max said the group had at times produced 16 technical drawings a week, double the normal rate."
  • "Facing tight deadlines and strict budgets, managers quickly pulled workers from other departments when someone left the Max project."
  • "Roughly six months after the project’s launch, engineers were already documenting the differences between the Max and its predecessor, meaning they already had preliminary designs for the Max — a fast turnaround, according to an engineer who worked on the project."
  • "A technician who assembles wiring on the Max said that in the first months of development, rushed designers were delivering sloppy blueprints to him. He was told that the instructions for the wiring would be cleaned up later in the process, he said."
    • "His internal assembly designs for the Max, he said, still include omissions today, like not specifying which tools to use to install a certain wire, a situation that could lead to a faulty connection. Normally such blueprints include intricate instructions."
  • "Despite the intense atmosphere, current and former employees said, they felt during the project that Boeing’s internal quality checks ensured the aircraft was safe"
  • “This program was a much more intense pressure cooker than I’ve ever been in,” he added. “The company was trying to avoid costs and trying to contain the level of change. They wanted the minimum change to simplify the training differences, minimum change to reduce costs, and to get it done quickly.”

I've worked on many fast-paced engineering projects. I've observed and personally made compromises to meet deadlines, and there are many that I disagreed with. All of these points are familiar and hit home. I was quite surprised to find that the culture that builds aircraft would be so similar to the culture that builds consumer electronics.

Delayed Software Updates

Weeks after the Lion Air crash, Boeing officials told the Southwest Airlines and American Airlines pilot's unions that they planned to have software updates available around the end of 2018.

“Boeing was going to have a software fix in the next five to six weeks,” said Michael Michaelis, the top safety official at the American Airlines pilots union and a Boeing 737 captain. “We told them, ‘Yeah, it can’t drag out.’ And well, here we are.”

The FAA told The Wall Street Journal that FAA work on the new MCAS software was delayed for five weeks by the government shutdown. However, the "enhancement" was submitted to the FAA for certification on 21 January, only four days before the shutdown ended.

The official software update was announced four months later than the initial estimate. It will still take many more months to approve and deploy.

We are all conditioned to waiting for fixes and updates. Teams are prone to giving idealistic estimates. Problems take longer than expected to diagnose, correct, and validate. Schedules are repeatedly overrun.

However, it's not going to comfort the families of those who lost their lives on Ethiopian Airlines Flight 302 that Boeing released a software fix for certification seven weeks before the fatal crash. There is a real cost to the delay of software updates, and that cost increases significantly with the impact of the issue. It is always better to take the necessary time to implement a robust design in order to avoid needing a patch at all.

Humans Were Out of the Loop

One uncomfortable computing fact remains true: humans are superior at dynamically receiving and synthesizing data.

Computers can only perform actions they were already programmed to do. A computer cannot take in additional data which it wasn't already programmed to read. The MCAS was designed to use a single data point, that of the AoA sensor on the corresponding side of the plane. The initial NTSC report on the Lion Air crash tells us that a single faulty AoA sensor triggered the MCAS.

If a pilot or co-pilot noticed a strange AoA reading (such as a 20-degree difference between the left and right AoA sensors), he or she could perform a "cross check" by glancing at the reading on the other side of the plane. Additional sensors and gauges can be read to corroborate or disprove a strange AoA reading. Hell, a pilot could even look out the window to get a sense of the plane's angle. The pilots could have a discussion and collectively determine which sensor they trusted. Our brains can take in any combination of this information and confirm/disprove a sensor reading.

What is even more troubling is that the system's behavior was opaque to the pilots. According to Boeing, the MCAS is (counter-intuitively) only active in manual flight mode, and is disabled when under autopilot. MCAS controls the trim without notifying the pilots that it is doing so.

Boeing did provide two optional features that would provide more insight into the situation:

  • An AoA indicator, which displays the sensor readings
  • An AoA disagree light, which lights up if the two AoA sensors disagree

But because these were optional, many carriers did not elect to buy them.

In a fight between an unaware human pilot and the MCAS, the MCAS has a fair chance of winning. Even if the pilot disables MCAS by setting a manual trim, MCAS would automatically kick back in if the high AoA reading was still detected. Combined with the fact that the MCAS could move the stabilizer 2.5 degrees per activation, it could continue to push the aircraft nose down until the stabilizer's force could no longer be overcome by the pilot's input.

Because of our superiority at dynamic information synthesis, humans must maintain the ability to override or overpower an automated process. At present, nothing in the world is as skilled at dealing with complexity and chaos as the human mind.

Boeing's Response

We've pointed a lot of fingers at Boeing, let’s take a moment to review what they are doing in response.

An MCAS software update has been announced:

Boeing has developed an MCAS software update to provide additional layers of protection if the AOA sensors provide erroneous data. The software was put through hundreds of hours of analysis, laboratory testing, verification in a simulator and two test flights, including an in-flight certification test with Federal Aviation Administration (FAA) representatives on board as observers.

The following changes will be made:

  • Flight control system will now compare inputs from both AOA sensors
  • If the sensors disagree by 5.5 degrees or more with the flaps retracted, MCAS will not activate
  • An indicator on the flight deck display will alert the pilots to AoA Disagree condition
    • This was previously a paid upgrade, but now will now ship as a standard feature
  • MCAS will also be disabled and if the AoA Disagree displayed with the AoA differs more than 10° for over 10 seconds during flight
  • If MCAS is activated in non-normal conditions, it will only provide one input for each elevated AOA event
    • There are no known or envisioned failure conditions where MCAS will provide multiple inputs.
  • MCAS can never command more stabilizer input than can be counteracted by the flight crew pulling back on the yoke.
    • The pilots will continue to always have the ability to override MCAS and manually control the airplane

In addition to the software changes, there are extensive training changes. Pilots will have to complete 21+ days of instructor-led academics and simulator training. Computer-based training will be made available to all 737 MAX pilots, which includes the MCAS functionality, associated crew procedures, and related software changes. Pilots will also be required to review the new documents:

  • Flight Crew Operations Manual Bulletin
  • Updated Speed Trim Fail Non-Normal Checklist
  • Revised Quick Reference Handbook

Boeing and the FAA participated in an evaluation of the software and 12 March test flight. Boeing will now work on getting the update approved for installation by the various airworthiness authorities around the world. I expect this to be a long road to approval after Boeing and the FAA destroyed their store of trust.

All of these actions seem correct to me as an engineer and systems builder. But I am crestfallen that they weren't included in the initial release.

Is This the Result of Bad Software?

It's very tempting to label the 737 MAX crashes as "caused by software." At some level, this is true. However, the MCAS appears to be a software patch applied to a larger systems problem (and a hastily assembled patch at that).

Let's walk through the chain that appears to have led us here:

  1. Fuel is expensive, and we want more efficient engines to reduce that burden
  2. Airbus was improving their aircraft, which placed pressure on Boeing to respond with their own improved platform
    1. The timeline was largely dictated by Airbus, not the time Boeing engineers needed to complete the project
  3. Boeing wanted to stick to the 737 platform for a variety of reasons:
    1. Faster time to market
    2. Lower cost for producing and certifying a new plane
    3. Pilot familiarity, leading to reduced training requirements for airlines
  4. Boeing sold the 737 MAX to airlines on the ideals of increased fuel efficiency, platform familiarity, and lower upgrade costs
  5. Bigger engines did not fit on the existing 737 platform, so modifications were needed:
    1. Move the engines forward
    2. Mount the engines higher
    3. Increase the height of the front landing gear
  6. These modifications changed the aerodynamics of the airplane, which should have changed certification requirements and required more training
  7. Instead Boeing created the MCAS to address the aerodynamic impact of the new design
  8. Boeing downplayed the MCAS system, which resulted in:
    1. Improper/insufficient certification
    2. Insufficient documentation
    3. Pilots received no training for handling the new 737 MAX

This is a systems engineering problem created by the company's design goals. Boeing's guiding light was to reuse the 737 platform so they could keep up with Airbus and minimize training requirements. Redesigning the airplane was entirely out of the question because it would give Airbus a significant time advantage and necessitate expensive training. To meet the design goals and avoid an expensive hardware change, Boeing created the MCAS as a software band-aid.

This scenario is quite familiar to me. As a firmware engineer, applying software workarounds for silicon or hardware design flaws is a major part of my work. Fixing hardware is "expensive" in terms of both time and money. At some point it's too late to change the hardware (or so I've been repeatedly told). The schedule drives the decision to move forward with known hardware design flaws.

The next line is predictable: "The problem will just have to be fixed in software." But software fixes do not always work. When the software workaround fails, we seem to forget that we were already attempting to hide a problem.

I am not alone in the view that this is not a "software problem". Trevor Sumner had an excellent Twitter Thread where he summarized the thoughts of Dave Kammeyer. Trevor's take extends beyond the Boeing analysis and even includes non-software factors leading to the Lion Air crashes (re-formatted for easier reading):

On both ill-fated flights, there was a:

  • Sensor problem. The AoA vane on the 737MAX appears to not be very reliable and gave wildly wrong readings. On #LionAir, this was compounded by a:
  • Maintenance practices problem. The previous crew had experienced the same problem and didn't record the problem in the maintenance logbook. This was compounded by a:
  • Pilot training problem. On LionAir, pilots were never even told about the MCAS, and by the time of the Ethiopian flight, there was an emergency AD issued, but no one had done sim training on this failure. This was compounded by an:
  • Economic problem. Boeing sells an option package that includes an extra AoA vane, and an AoA disagree light, which lets pilots know that this problem was happening. Both 737MAXes that crashed were delivered without this option. No 737MAX with this option has ever crashed. All of this was compounded by a:
  • Pilot expertise problem. If the pilots had correctly and quickly identified the problem and run the stab trim runaway checklist, they would not have crashed.

His closing point is savage (emphasis mine):

Nowhere in here is there a software problem. The computers & software performed their jobs according to spec without error. The specification was just shitty. Now the quickest way for Boeing to solve this mess is to call up the software guys to come up with another band-aid.

I've watched the "fix it in software" cycle play out repeatedly when developing iPhones. Should we be surprised that the same happens for an airplane too? What would prevent it, the idea of a safety culture? Can you ever be truly safe when you are optimizing for time-to-market and reduced costs.

After the resulting deaths, loss in market cap, and destruction of trust, one must wonder if Boeing will ever realize the cost savings they hoped the software fix would provide.

Note: We should leave open the possibility that there is a compounding software issue at play, since there are ASRS reports which indicate problems that occurred with autopilot on, a scenario where MCAS is supposed to be inactive.

Lessons We Can Apply to Our Systems

A complex system operated in an unexpected manner, and 347 people are dead as a result. We cannot restore their lives, but we must learn as much as possible to prevent such deaths in the future.

These are the lessons that I've learned from this investigation so far:

You Cannot Bend Complex Systems To Your Will

Boeing took an existing complex system and tried to change that system to force a specific outcome. Systems thinkers everywhere are cringing at this, because all changes to complex systems have unintended consequences.

Donna Meadows said in "Dancing with Systems":

But self-organizing, nonlinear, feedback systems are inherently unpredictable. They are not controllable. They are understandable only in the most general way. The goal of foreseeing the future exactly and preparing for it perfectly is unrealizable. The idea of making a complex system do just what you want it to do can be achieved only temporarily, at best. We can never fully understand our world, not in the way our reductionistic science has led us to expect. Our science itself, from quantum theory to the mathematics of chaos, leads us into irreducible uncertainty. For any objective other than the most trivial, we can’t optimize; we don’t even know what to optimize. We can’t keep track of everything. We can’t find a proper, sustainable relationship to nature, each other, or the institutions we create, if we try to do it from the role of omniscient conqueror.

Donna continues:

Systems can’t be controlled, but they can be designed and redesigned. We can’t surge forward with certainty into a world of no surprises, but we can expect surprises and learn from them and even profit from them. We can’t impose our will upon a system. We can listen to what the system tells us, and discover how its properties and our values can work together to bring forth something much better than could ever be produced by our will alone.

These thoughts are echoed by Dr. Russ Ackoff in a short talk titled "Beyond Continual Improvement". The points he makes in that brief fifteen minutes repeatedly echoed in my head while writing this essay.

A system is not the sum of the behavior of its parts, it is a product of their interactions. The performance of a system depends on how the parts fit, not how they act taken separately.

Boeing changed a few individual parts of the plane and expected the overall performance to be improved. But the effect on the overall system was more complex than the changes led them to expect.

When you get rid of something you don’t want (remove a defect), you are not guaranteed to have it replaced with what you do what.

We are all familiar with the experience of fixing a bug, only to have a new bug (or several) appear as a result of our fix.

Finding and removing defects is not a way to improve the overall quality or performance of a system.

The larger engines on the 737 airframe resulted in undesirable flight characteristics (excessive upward pitch at steep AoA). Boeing responded by attempting to address this defect with the MCAS. It's clear that the MCAS does not unilaterally improve the overall quality or performance of the aircraft.

What aspects of your system are you trying to force? Perhaps you can broaden your perspective and look at different approaches. The answer will reveal itself if you listen, though you might have to head in a different direction than you orginally intended.

Where You are Aiming is the Most Important Thing

There is an idea that I've been holding in the forefront of my mind: nothing has more of an impact on where you will eventually end up as where you are aiming. Setting the right aim is the most important thing.

It seems to me that Boeing's aim was to keep up with Airbus, leading to an aggressive time-to-market. They also wanted to minimize changes to ease certification and ensure that pilots did not need to receive new training. Those are the principles that appear to have guided their actions. Safety was still a concern, but that is not what the organization, system, or schedule focused on.

Dr. Ackoff echoes this idea in "Beyond Continual Improvement"

Basic principle: an improvement program must be directed at what you want, not at what you don’t want

At one level, we can say that Boeing wanted a new aircraft with improved fuel efficiency to compete with Airbus.

At another level, what Boeing wanted was to design a new aircraft with improved fuel efficiency, but in such a way as to not require a new airframe design, to not require a timeline that delayed them significantly with regards to the Airbus launch, and to not require pilots to receive training on the new airplane.

Boeing seems to have focused heavily on the things they did not want out of the improved design.

If you stick to the base level of desire (wanting a new aircraft with improved fuel efficiency), it seems that the system needed to be largely redesigned with a new airframe to support larger engines.

Your company’s aim is a truly powerful force. Your organization is headed in only that direction.

Ask yourselves often: is it the proper aim?

Treat Documentation as a First Class Citizen

If other people will use your product, you need to treat documentation as a first class citizen. Useful and comprehensive documentation and training is extremely important to your users and the engineers and managers that come after you.

Pilots are fanatical about their documentation, as well they should be. There is clear and documented outrage that details were kept from them.

In this case, improved documentation would have led to better understanding of the system forces at work. Improved documentation alone could have potentially saved hundreds of lives.

We try to hold back because we think our users don't need (or can't handle) the details:

One high-ranking Boeing official said the company had decided against disclosing more details to cockpit crews due to concerns about inundating average pilots with too much information - and significantly more technical data - than they needed or could digest.

Software teams often take this view of their users. Perhaps it is simply a rationalization for not wanting to put the effort into creating and maintaining documentation. How can we predict what information people need to know? What is too technical, and what is enough information? Won't the details change as the system evolves? How will we keep it maintained?

When we leave out documentation or fudge the explanations of how things work, we hinder our users. What could your users accomplish with your system if they had a full understanding of how it worked? I guarantee they can handle and achieve much more than you expect.

Software teams also hinder themselves when they neglect documentation. When we document, we are acting as explorers, mapping uncharted territory. New team members can learn how the system is designed. Ideas for simplification will jump out at you. You'll start thinking about novel ways to use your software and the edge cases that will be encountered. Poorly understood system aspects are suddenly obvious - "here be dragons".

It's a popular adage: if you can't explain something in simple terms, you don't understand it. And if you don't explain something, nobody else has a chance of understanding it.

Keep Humans in the Loop

I stated earlier that humans must maintain the ability to override or overpower an automated process. Because of our superiority at dynamic information collection and synthesis, we can improvise and make novel decisions in response to new situations. A computer, which has been preprogrammed to read from a limited amount of information and perform a set of specific responses, is not (yet) capable of improvising.

“What we have here is a ‘failure of the intended function,’ going back to your recent piece [on SOTIF — Safety of the Intended Functionality]. Barnden said, “A plane shouldn’t fight the pilot and fly into the ground. This is happening after decades of R&D into aviation automation, cockpit design and human factors research in planes

System designers and programmers are not all-knowing. Make sure that humans are kept in the loop - let them override your automated processes. Perhaps they know better after all.

Testing Doesn't Mean You Are Safe

Phil Koopman recently wrote about a concept he calls The Insufficient Testing Pitfall:

Testing less than the target failure rate doesn't prove you are safe. In fact you probably need to test for about 10x the target failure rate to be reasonably sure you've met it. For life critical systems this means too much testing to be feasible.

No doubt about it: the airplane and software were tested. Probably significantly. Certainly in simulators and in test flights. But it seems that Boeing did not test the system enough to encounter these problems. And even if they did - what other problems would still be missed?

We need a plan for proving that our software works safely. Testing is not enough.

Could This Happen in Your Organization?

It's easy for us to read about the Boeing 737 MAX saga, or other similar human-caused disasters, and think that we would never have walked down the path that led there. I implore you to have sympathy and understanding. Humans committed those actions. You are also human. You (and the organizations you are a part of) are capable of the same actions, for the same reasons. Keep the possibility of catastrophe in mind when you are tempted to let standards slide.

All of this is familiar to me as an engineer. I've worked on many fast-paced engineering projects. I've observed and personally made compromises to meet deadlines: some I proposed myself, and others that I disagreed with. I've seen these compromises work out, and I've seen them fail spectacularly. I got lucky. I don't work on safety critical software, and I have never watched people die at the hands of my systems. I have deep sympathy for the engineers who will be forever plagued by their creation.

After the Lion Air crash, Boeing offered trauma counseling to engineers who had worked on the plane. “People in my group are devastated by this,” said Mr. Renzelmann, the former Boeing technical engineer. “It’s a heavy burden.”

We must also remember that nobody at Boeing wanted to trade human lives for increased profits. All human organizations - families, companies, industries, governments - are complex systems and have a life of their own. The organization can make and execute a decision which none of the participants truly want, such as shipping a compromised product or prioritizing profits over safety.

What I see with Boeing is an organization that made the same kind of decisions that I regularly see made at every organization I've been a part of. Like at all of these other organizations, they did not escape the consequences of their decisions. The difference for Boeing is that they were playing for bigger stakes, and the result of their misplaced bet is more painful.

There was not villainous a CEO who forced his minions to compromise the product. There was not an entire organization whose individuals decided to collectively disregard safety. The organization rallied around the goals of time-to-market and minimizing required pilot training. Momentum and inertia kept the company marching toward their aim, even if individuals disagreed. And perhaps nobody explicitly noticed that safety was de-prioritized as a result.

I want to repeat this: Boeing made the same decisions that are being made everywhere else.

We all have a duty to aim higher.

Further Reading

For more on the Boeing 737 MAX Saga:

Commentary on the situation:

Thoughts on Autonomy and Safety:

Acknowledgments

Our creations are never the result of a single mind.

I want to thank Rozi Harris and Stephen Smith for reviewing early drafts of this essay. Their feedback, conversation, and exploration of the topics at hand has been extremely helpful. Many of their discussion points were incorporated into the essay.

Thank you to the hard-working journalists and aviation fanatics who have published brilliant coverage and analysis for the 737 MAX saga. I know only a fraction of what others know about the problems discussed herein.

I also want to thank all of my colleagues who stood beside me over the years. It takes a monumental effort to build something new, and it rarely works out. We should all be amazed at our combined human triumph.

The lessons I present are hard-won, collectively generated, and the result of long debates. I hope the next generation of creators can use them to move beyond our current capabilities.