A General Overview of What Happens Before main()

Updated: 20190909

For most programmers, a C or C++ program's life begins at the main function. They are blissfully unaware of the hidden steps that happen between invoking a program and executing main. Depending on the program and the compiler, there are all kinds of interesting functions that get run before main, automatically inserted by the compiler and linker and invisible to casual observers.

Unfortunately for programmers who are curious about the program startup process, the literature on what happens before main is quite sparse.

Embedded Artistry has been hard at working creating a C++ embedded framework. The final piece of the puzzle was implementing program startup code. To aid in the design of our framework's boot process, I performed an exploratory survey of existing program startup implementations. My goal is to identify a general program startup model. I also want to provide a more comprehensive look into how our programs get to main.

In this six-part series, we will be investigating what it takes to get to main:

  1. What Happens Before main()?
  2. Exploring Startup Implementations: Newlib (ARM)
  3. Exploring Startup Implementations: OS X
  4. Exploring Startup Implementations: Custom Embedded System with ThreadX
  5. Abstracting a Generic Flow for Getting to main
  6. Implementing our Generic Startup Flow

To begin our investigation, we will provide a summary of what happens in a program before main. The steps and responsibilities we describe are generalized so that they apply to most systems. We will supplement the general theory in the following articles with an analysis of real-world implementations.

Table of Contents:

  1. Getting to Main: A General Overviewy
    1. The _start Function
    2. Runtime Setup
    3. Other Scaffolding
    4. Jumping to main
    5. Returning from main
  2. How Do We Get to _start?
    1. Baremetal: reset vector
    2. Bootloader launches application
    3. OS Calls an exec function
  3. Exploring On Your Own
  4. Further Reading

Getting to Main: A General Overview

Before we dive into our exploration of how existing systems get to main, we should develop a hypothesis about what generally happens. Since others have already explored program startup, we can start with a clear idea of what happens before main.

The _start Function

For most C and C++ programs, the true entry point is not main, it's the _start function. This function initializes the program runtime and invokes the program's main function.

The use of _start is merely a general convention. The entry function can vary depending on the system, compiler, and standard libraries. For example, OS X only has dynamically linked applications; the loader takes care of setup, and the entry point to the program is actually main.

The linker controls the program's entry point. The default entry point can be overridden by clang and GCC linkers using the -e flag, although this is rarely done for most programs.

The implementation of the _start function is usually supplied by libc. The _start function is often written in assembly. Many implementations store the _start function in a file called crt0.s. Compilers typically ship with pre-compiled crt0.o object files for each supported architecture.

Program startup code behavior is not specified by the C and C++ standards. Instead, the standards describe the conditions that must be true when the main function is called. However, there are many steps that are commonly performed across the majority of _start implementations.

At a high level, the _start function handles:

  1. Early low-level initialization, such as:
    1. Configuring processor registers
    2. Initializing external memory
    3. Enabling caches
    4. Configuring the MMU
  2. Stack initialization, making sure that the stack is properly aligned per the ABI requirements
  3. Frame pointer initialization
  4. C/C++ runtime setup
  5. Initializing other scaffolding required by the system
  6. Jumping to main
  7. Exiting the program with the return code from main

While the _start routine typically encompasses these activities, the specific order and implementation varies from system to system. For example, early low-level initialization code is commonly found with bare-metal embedded systems, but rarely on host machines with an OS. Your Linux or OS X program startup code will have multiple scaffolding functions which you will not find in embedded startup code.

Let's take a look at a simple implementation of an x86_64 _start function taken from the OS Dev wiki. This example provides us with a preview of the basic skeleton for program startup. The implementations we will review later in this series are much more complex.

The startup code below assumes that the program loader put:

  • *argv and *envp variables on the stack
  • argc in register %rdi
  • argv in register %rsi
  • envc in register %rdx
  • envp in register %rcx

Here's the implementation of _start:

.section .text

.global _start
    # Set up end of the stack frame linked list
    movq $0, %rbp
    pushq %rbp # rip=0
    pushq %rbp # rbp=0
    movq %rsp, %rbp

    # Save argc and argv on the stack
    # We need those in a moment when we call main
    pushq %rsi
    pushq %rdi

    # Prepare signals, memory allocation, stdio, etc.
    call initialize_standard_library

    # Run the global constructors.
    call _init

    # Restore argc and argv before calling main
    popq %rdi
    popq %rsi

    # Run main
    call main

    # Terminate the process with the exit code 
    # that was returned from main
    movl %eax, %edi
    call exit

Let's dive in and see what happens during the runtime setup process (initialize_standard_library above).

Runtime Setup

C/C++ runtime setup is a universal requirement for program startup. At a high level, our runtime setup must accomplish the following:

  1. Relocate any relocatable sections (if not handled by the loader or linker)
  2. Initializing global and static memory
  3. Prepare the argc and argv variables for invoking main (even if it's just setting these to 0/NULL)

Initializing global and static memory is broken down into two distinct steps that deserve additional details.

First, the runtime initializes a subset of uninitialized memory (no = in the declaration) to 0. This includes global and static variables, but not stack variables. All uninitialized data that needs to be set to 0 is placed into the .bss section of the compiled program image by the linker. The location of the .bss section is identified during initialization, and the memory is typically set to 0 with memset.

Second, C++ global objects must be properly constructed before calling main. The linker places these constructors into the .init, .init_array, or .ctors section of the image. Some compilers also allow C and C++ functions to be marked as a constructor using a compiler attribute (e.g., __attribute__((constuctor))). The constructors are stored in a list by the linker. The runtime initialization process iterates through the list and calls each constructor.

These additional runtime initialization steps are run for most programs (but not all):

  1. Heap initialization
  2. Initialize stdio (i.e., stdin,stdout,stderr`)
  3. Initialize exception support
  4. Register destructors and other cleanup functions that will run when exiting the program (using atexit and __cxa_atexit)
  5. Prepare environment variables

In practice, the line between the responsibilities of _start and the C runtime initialization can be fuzzy. Some implementations of _start handle pieces of the runtime setup directly, such as setting the .bss section contents to 0 and calling global constructors. Other implementations implement those tasks in the runtime setup routines.

Assembly files commonly found during this portion of the startup process are crtbegin.s, crtend.s, crti.s, and crtn.s. Compilers often ship pre-compiled object files for supported architectures. These files are related to calling global constructors and destructors. When the files are not used, equivalent functionality is often implemented in C and invoked during runtime initialization.

Other Scaffolding

The setup process may invoke other functions to set up program scaffolding that the system requires. Program scaffolding setup before main might include:

  1. Threading support and thread local storage
  2. Buffer overrun detection
  3. Stack logging
  4. Run-time error checks
  5. Locale settings
  6. Math error handling
  7. Default math library precision

The specific scaffolding functions invoked vary across standard library implementations and operating systems.

Jumping to main

Once we have a fully initialized system, we can safely jump to main and execute the programmer's portion of the application.

The most important aspect: once the program reaches main, it must be in a standards-conforming state. Otherwise, the program's assumptions will be invalidated.

Returning From main

While we were primarily interested in how we get to main, we should finish our explanation of the _start function's responsibilities.

Because _start invokes main, it also handles its return. When control returns from main to _start, the next function to run is exit. The exit function calls all functions registered with atexit and __cxa_atexit during the startup process. Then exit calls the global destructors (those placed in the .fini, .fini_array, or .dtors sections). Finally, exit terminates the program with the return value provided by main.

The exit function is primarily used for hosted programs. Bare metal programs rarely have use for the exit function or global destructors.

How Do We Get to _start?

Now that we know how our program gets to main by way of the _start function, you may wonder how the program gets to _start.

There are three common paths:

  1. Baremetal: reset vector
  2. Bootloader launches application
  3. OS Calls an exec function

Baremetal: Reset Vector

A baremetal embedded application represents the simplest path to _start.

Consider a baremetal platform with a binary stored in flash memory. When power is applied to the processor, the processor will copy the program from flash and store it in RAM1. Once the program is loaded into memory, the processor jumps to the reset interrupt vector address.

The embedded program's reset interrupt handler initializes the system after power-on or reset. The reset handler typically performs an initial configuration of the processor registers and critical hardware components (such as external RAM, caches, or MMU). Once this initial configuration is complete, the reset handler jumps to _start.

Some systems do not utilize the C standard library, and in that case _start will not be called. Instead, the reset handler will invoke other setup functions or will directly execute necessary program setup steps.

1: If the chip supports execute-in-place (XIP), the processor will skip the copy step and run the program directly from flash memory.

Bootloader Launches Application

Many embedded applications are composed of multiple distinct images which run sequentially during the boot process.

Many systems use a bootloader or hypervisor, which runs before loading and executing the main application. Bootloaders perform a wide range of activities, including initializing system hardware, decryption, decompression, checking that a firmware image is valid before loading it, selecting a firmware image to boot, or determining whether to enter firmware update mode. Bootloader complexity depends on the system's requirements; not of the listed tasks tasks will be performed.

Other systems require an incremental boot process, especially when the main application is larger than the processor's internal RAM capacity. The first boot stage is typically a small image which fits into the processor's internal memory. This image will initialize external RAM and load the main application from flash into the external RAM. The first stage boot may perform additional steps, such as processor vector remapping or MMU configuration. Once the main application is loaded, the first stage boot invokes the reset vector of the main application.

Multi-stage boot scenarios complicate the program startup model. Each boot stage is technically a standalone program. However, not every stage will run through the full program startup process. Simple boot stages may only need to clear the .bss section to perform their duties, while complex bootloaders need a fully initialized C/C++ runtime. Program startup activities may be distributed across the boot process, with each stage handling specific tasks.

OS Calls an exec function

The most complex scenario is running a program on a host machine with a fully-fledged OS.

When you launch a program, your shell or GUI invokes a program loader. The loader is responsible for copying the application image from the hard drive into memory and configuring the environment that the program will run in. On Linux or OS X, the loader is a function in the exec() family typically execve() or execvp(). For Windows, the loader is the LdrInitializeThunk function in ntdll.dll.

Loaders will often perform the following actions:

  • Check permissions
  • Allocate space for the program's stack
  • Allocate space for the program's heap
  • Initialize registers (e.g., stack pointer)
  • Push argc, argv, and envp onto the program stack
  • Map virtual address spaces
  • Dynamic linking
  • Relocations
  • Call pre-initialization functions

Once the loader has configured the program environment, it calls the program's _start function.

Exploring On Your Own

In the next three articles, we will review a selection of startup procedures which differ greatly in terms of process and style:

  1. Newlib (ARM)
  2. OS X
  3. Custom Embedded System with ThreadX

We won't be reviewing Linux program startup, because there are already high-quality articles on that topic. For detailed descriptions about how Linux programs start, we recommend these articles:

  1. How Programs Get Run
  2. How Programs Get Run: ELF Binaries
  3. Linux x86 Program Start Up or - How the heck do we get to main()?

The startup code that your system runs is supplied by your libc implementation and system libraries, and the implementations will also vary depending on the target architecture. Don't be surprised if you find a different startup process than those described in this series and in other articles around the web.

You can explore your own program's startup behavior using objdump or a debugger (I.e. gdb, lldb). You can use debugging tools to tackle the problem from a variety of directions:

  1. Set a breakpoint at main() and run a backtrace to see the function call stack
  2. Set a breakpoint at _start() (or whatever entry point your backtrace shows) and step through the execution
  3. Dump the assembly output for the program using objdump

As Daniel Näslund pointed out in the comments, your debugger may be configured to suppress backtraces that go past the main function. For gdb, you can run the following command:

(gdb) set backtrace past-main on

Further Reading

Change Log

  • 20190909:
    • Added links to a great Matt Godbolt talk
    • Added links to Memfault's "Zero to Main()" series

Related Articles

EMB2: A C/C++ Framework for Multi-core and Multi-chip Embedded Systems

EMB2 is a C/C++ framework developed by Siemens and the University of Houston. EMB2 provides generic building blocks for building multi-core or multi-chip embedded applications, including basic parallel algorithms, concurrent data structures, and application skeletons. Since EMB2 is targeted for embedded applications, it provides soft-real-time support, predictable memory consumption (no dynamic memory allocations after startup), support for task priorities and affinities, and non-blocking APIs.

The framework utilizes the Multicore Association's Task Management abstraction layer, MTAPI, enabling EMB2 programs to be easily ported to new operating systems and processor architectures. By utilizing MTAPI, heterogeneous and distributed embedded programming is simplified, and developers can easily distribute work across processor cores, hardware accelerators, GPUs, DSPs, FPGAs, or networked devices.

The EMB2 base library is implemented as a C API with C++ wrappers, while the parallel algorithms, dataflow patterns, and concurrent containers are implemented in C++. C99 and C++03 are used as the implementation standard to provide maximum usability in the embedded world, though C11 and C++11 are also supported.

If you are building a product which uses a multi-core processor, multiple processors, or hardware accelerators, EMB2 provides a solid and portable foundation that will enable your team to take full advantage of your system's hardware resources.

For more on EMB2:

Simple Fixed-Point Conversion in C

Operating on fixed-point numbers is a common embedded systems task. Our microcontrollers may not have floating-point support, our sensors may provide data in fixed-point formats, or we may want to use fixed-point mathematics control a value's range and precision.

There numerous fixed-point mathematics libraries around the internet, such as fixed_point or the Compositional Numeric Library for C++. If you are looking for a reliable solution to utilize long-term, spend some time to review these libraries to identify candidates for integration.

However, we don't always have the time required to select a library. Perhaps you just need to convert a fixed-point number for prototyping purposes, or you need to do a quick implementation for Friday's demo.

Below is a quick-and-dirty approach for converting between fixed-point and floating-point numbers. If you need to handle mathematical operations on fixed-point numbers, look for a library to integrate.

Lossy Conversion of Fixed-Point Numbers

First, we need to select our fixed-point type. For this example, we'll be using 16-bit fixed point numbers, in an 11.5 format (11 integral bits, 5 fractional bits):

/// Fixed-point Format: 11.5 (16-bit)
typedef uint16_t fixed_point_t;

We'll make a quick macro for the number of fractional bits:


Then we'll define two conversion functions:

/// Converts 11.5 format -> double
double fixed_to_float(fixed_point_t input);

/// Converts double to 11.5 format
fixed_point_t float_to_fixed(double input);

Now that we've gotten the groundwork out of the way, we'll write our fixed-point to floating-point conversion function. Converting from fixed-point to floating-point is straightforward. We take the input value and divide it by (2fractional_bits), putting the result into a double:

inline double fixed_to_float(fixed_point_t input)
    return ((double)input / (double)(1 << FIXED_POINT_FRACTIONAL_BITS));

To convert from floating-point to fixed-point, we follow this algorithm:

  1. Calculate x = floating_input * 2^(fractional_bits)
  2. Round x to the nearest whole number (e.g. round(x))
  3. Store the rounded x in an integer container

Using the algorithm above, we would implement our float-to-fixed conversion as follows:

inline fixed_point_t float_to_fixed(double input)
    return (fixed_point_t)(round(input * (1 << FIXED_POINT_FRACTIONAL_BITS)));

However, not all of our embedded systems utilize the standard library, and perhaps round() is not supplied. You can also just rely on truncation when converting to an integer. There will be some precision loss, but for a quick-and-dirty solution that may be acceptable:

inline fixed_point_t float_to_fixed(double input)
    return (fixed_point_t)(input * (1 << FIXED_POINT_FRACTIONAL_BITS));

If you need to support multiple fixed-point styles, you can provide interfaces for various integer widths and add the fractional bit count as an input argument:

// Convert 16-bit fixed-point to double
double fixed16_to_double(uint16_t input, uint8_t fractional_bits)
    return ((double)input / (double)(1 << fractional_bits));

// Equivalent of our 11.5 conversion function above
double r = fixed16_to_double(input, 5);

There you have it: quick-and-dirty fixed-point conversion methods.

Further Reading

Related Posts