OSX

Exploring Startup Implementations: OS X

For most programmers, a C or C++ program's life begins at the main function. They are blissfully unaware of the hidden steps that happen between invoking a program and executing main. Depending on the program and the compiler, there are all kinds of interesting functions that get run before main, automatically inserted by the compiler and linker and invisible to casual observers.

Unfortunately for programmers who are curious about the program startup process, the literature on what happens before main is quite sparse.

Embedded Artistry has been hard at working creating a C++ embedded framework. The final piece of the puzzle was implementing program startup code. To aid in the design of our framework's boot process, I performed an exploratory survey of existing program startup implementations. My goal is to identify a general program startup model. I also want to provide a more comprehensive look into how our programs get to main.

In this six-part series, we will be investigating what it takes to get to main:

  1. A General Overview of What Happens Before main()
  2. Exploring Startup Implementations: Newlib (ARM)
  3. Exploring Startup Implementations: OS X
  4. Exploring Startup Implementations: Custom Embedded System with ThreadX
  5. Abstracting a Generic Flow for Getting to main
  6. Implementing our Generic Startup Flow

Now that we have a high-level understanding of how our programs get to main, we can explore real-world implementations of program startup code.

Today's analysis focuses on OS X program startup code. OS X may seem like a strange choice for an embedded blog. I chose OS X for these reasons:

  1. OS X provides a different program startup model than the other systems that we will explore
  2. OS X seems unique in that all applications are dynamically linked
  3. Developers in general seem to be more familiar with ELF than Mach-O
  4. Dynamic loading is outside of my comfort zone, and I will have an opportunity to push my own limits

If you want to explore OS X program startup behavior on your own, you can download the dyld source or browse the source code online.

The boot flow is quite complicated, and it's easy to get lost. You can refer to the Visual Summary throughout the article for a visual representation of the startup procedure and call stack. Additionally, dyld is a large and complicated program. To prevent this article from becoming unnecessarily dense, we will be sticking to a high level analysis and glossing over some implementation details.

Table of Contents:

  1. Mach-O Format
  2. OS X: No Static Applications
  3. x86_64 Assembly Overview
  4. System Configuration
  5. Initial Exploration
    1. Backtrace
    2. Disassembly
  6. OS X Program Startup
    1. Launching a Program
    2. The Dynamic Linker
    3. dyld Source Code Analysis
    4. libSystem
  7. Visual Summary
  8. Startup Activity Checklist
  9. Further Reading

Mach-O Format

Mach-O is an file format used by Apple for macOS and iOS. On OS X, all native applications use the Mach-O format. You can identify Mach-O dynamic libraries by the suffix .dylib. We only need a basic understanding of the file format for this article, so I will be discussing high level details.

A Mach-O file has three regions:

  1. Mach-O header, with general information about the binary
    1. Byte order
    2. CPU Type
    3. Number of load commands
  2. Load commands, which describe segments, symbol tables, entry points, and more
    • There are a variety of load commands, and each command has its own associated metadata
    • You will probably see 15+ load commands for a binary
  3. Program data, which includes things like:
    1. Symbol tables
    2. Dynamic symbol tables
    3. Code (__TEXT segment)
    4. Data (__DATA segment)

You view the Mach-O header and load commands for a Mach-O application using otool:

$ otool -l buildresults/test/libmemory_freelist_test buildresults/test/libmemory_freelist_test

This will display the Mach-O header and a long list of load commands. In my case, there are 17 load commands.

Here's example header output:

libmemory_freelist_test:
Mach header
      magic cputype cpusubtype  caps    filetype ncmds sizeofcmds      flags
 0xfeedfacf 16777223          3  0x80           2    17       1560 0x00218085

Most of the load commands describe segments:

Load command 0
      cmd LC_SEGMENT_64
  cmdsize 72
  segname __PAGEZERO
   vmaddr 0x0000000000000000
   vmsize 0x0000000100000000
  fileoff 0
 filesize 0
  maxprot 0x00000000
 initprot 0x00000000
   nsects 0
    flags 0x0

The path to the dynamic linker is always included in the Mach-O files:

Load command 7
          cmd LC_LOAD_DYLINKER
      cmdsize 32
         name /usr/lib/dyld (offset 12)

As well as the entry point for the program:

Load command 11
       cmd LC_MAIN
   cmdsize 24
  entryoff 4064
 stacksize 0

The load commands describe dynamic libraries required by the application, with one load command per library:

Load command 12
          cmd LC_LOAD_DYLIB
      cmdsize 56
         name /usr/lib/libSystem.B.dylib (offset 24)
   time stamp 2 Wed Dec 31 16:00:02 1969
      current version 1252.200.5
compatibility version 1.0.0
Load command 13
          cmd LC_LOAD_DYLIB
      cmdsize 72
         name /usr/local/opt/cmocka/lib/libcmocka.0.dylib (offset 24)
   time stamp 2 Wed Dec 31 16:00:02 1969
      current version 0.5.1
compatibility version 0.0.0

You will see other load command types as well; I've highlighted the more important ones that we will see in our analysis.

OS X: No Static Applications

When compiling for OS X, you cannot [easily] produce statically linked applications. The reason for this is that libSystem, which provides C runtime and general system functionality, is only provided as a dynamic library (libSystem.dylib). You can technically create a statically linked application if you don't need to link with libSystem, but this is not feasible for most programs. As a consequence, our program startup exploration will involve a dynamic linker.

This limitation is primarily limited to the OS X system libraries. You can still create static libraries on OS X, and they can be statically linked into the final application.

x86_64 Assembly Overview

We'll look at some x86_64 assembly, and I think it's always good to have a high-level overview so the code doesn't look like Greek.

x86_64 assembly provides 16 registers which we will generally encounter:

  1. rax: register a extended
  2. rbx: register b extended
  3. rcx: register c extended
  4. rdx: register d extended
  5. rbp: base pointer (start of stack/frame)
  6. rsp: stack pointer
  7. rsi: register source index (source for data copies)
  8. rdi: register destination index (destination for data copies)
  9. r8: register 8
  10. r9: register r
  11. r10: register 10
  12. r11: register 11
  13. r12: register 12
  14. r13: register 13
  15. r14: register 14
  16. r15: register 15

The r prefix indicates a 64-bit register. 32-bit registers use the e prefix (eax) or d suffix (r9d).

Register names are prefixed by a % (e.g., %rsi). Immediate values are prefixed by `$. Indirect memory accesses are indicated with (parentheses).

Common commands we'll encounter are:

  • mov S, D: move from source to destination
  • push S: push source onto stack
  • pop D: pop top of stack into destination
  • call Label: pushes the return address and jumps to the label

There are a variety of suffixes used with many x86 commands to indicate size:

  • q = quadword, or 8-byte value
  • l = double-word, or four-byte value
  • w = word, or two-byte value
  • b = byte

For example, movq is move a quad-word.

During a function call, the following rules apply for the System V ABI (which is used by macOS and Linux):

  • The first six function arguments are stored in rdi, rsi, rdx, rcx, r8d, and r9d
  • Additional arguments are stored on the stack
  • The return value is stored in rax
  • The called routine must preserve rsp, rbp, rbx, r12, r13, r14, and r15.

System Configuration

For this analysis, I am using a MacBook Pro from Mid-2014. The processor is an Intel Core i5 (x86_64). My computer is running macOS Mojave version 10.14.3.

The Apple clang version is:

$ gcc -v
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

I also use mainline clang on this computer:

$ clang -v
clang version 7.0.1 (tags/RELEASE_701/final)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /usr/local/opt/llvm/bin

Initial Exploration

Just like the Newlib exploration, I'll begin by building a program and trying to figure out what functions are called before main.

OS X is my primary development environment, so I'll use an existing program for this analysis: the libmemory unit tests.

Backtrace

First, we'll generate a backtrace to see what functions are called. Launch lldb with the application:

06:45:38 (master) libmemory$ lldb buildresults/test/libmemory_freelist_test

Set a breakpoint at main, and run the program:

(lldb) b main
Breakpoint 1: where = libmemory_freelist_test`main, address = 0x0000000100000fe0
(lldb) run
Process 71726 launched: '/Users/pjohnston/src/ea/embedded-framework/src/stdlibs/libmemory/buildresults/test/libmemory_freelist_test' (x86_64)

When we break at main, the backtrace command shows us the call stack:

Process 71726 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000100000fe0 libmemory_freelist_test`main
libmemory_freelist_test`main:
->  0x100000fe0 <+0>: pushq  %rbp
    0x100000fe1 <+1>: movq   %rsp, %rbp
    0x100000fe4 <+4>: subq   $0x10, %rsp
    0x100000fe8 <+8>: movl   $0x0, -0x4(%rbp)
Target 0: (libmemory_freelist_test) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100000fe0 libmemory_freelist_test`main
    frame #1: 0x00007fff79c48ed9 libdyld.dylib`start + 1
    frame #2: 0x00007fff79c48ed9 libdyld.dylib`start + 1

It looks like the true start function for our program is contained in libdyld, the dynamic loader library. It's curious that there are two sequential frames with the same function address; maybe that will reveal itself when we look at the source code.

Disassembly

We can take a first look at the disassembly for the libdyld start function:

(lldb) disassemble -m -a 0x00007fff79c48ed8
libdyld.dylib`start:
0x7fff79c48ed8 <+0>: nop
0x7fff79c48ed9 <+1>: movl   %eax, %edi
0x7fff79c48edb <+3>: callq  0x28abc                   ; symbol stub for: exit
0x7fff79c48ee0 <+8>: hlt

It's much shorter than I expected. It looks like some registers are adjusted and then a stub for exit is called. We need to see the source code to understand this mystery.

OS X Program Startup

Our previous analysis of the Newlib ARM startup code used an embedded processor. That program begins execution when power is applied to the processor, and terminates when exit is called or when power is removed. Our OS X analysis will differ greatly from the Newlib analysis. We are now looking at a program run on a fully-fledged operating system, which can run multiple different programs at once.

Launching a Program

Our journey starts by invoking a program. Apple's "Executing Mach-O Files" gives us a helpful description for the initial steps:

When you launch an application from the Finder or the Dock, or when you run a program in a shell, the system ultimately calls two functions on your behalf, fork and execve. The fork function creates a process; the execve function loads and executes the program. There are several variant exec functions, such as execl, execv, and exect, each providing a slightly different way of passing arguments and environment variables to the program. In OS X, each of these other exec routines eventually calls the kernel routine execve.

We've encountered the exec function family before, in our general program startup overview. For more information on execve, take a look at this article.

On OS X, all roads lead to the execve function, which is the program loader. This function copies the application image from the hard drive into memory and configures the environment that the program will run in. The execve function also provides our program with arguments (argc and argv) and environment variables (envp).

When you call execve, the kernel performs the following actions:

  1. Load the file into memory
  2. Analyze the mach_header structure at the start of the file to confirm that it's a valid Mach-O file
  3. Interprets the load commands stored in header to load the program into allocated address space with the proper protection flags (e.g. __TEXT segment is read-only)
  4. Loads the dynamic linker specified by the load commands
  5. Executes the dynamic linker on the program file

Here's an example load command for the __TEXT segment. Note that the segment contains multiple sections. For each section, the load commands specify addresses, sizes, file offsets, alignment, and flags.

Load command 1
      cmd LC_SEGMENT_64
  cmdsize 472
  segname __TEXT
   vmaddr 0x0000000100000000
   vmsize 0x0000000000002000
  fileoff 0
 filesize 8192
  maxprot 0x00000007
 initprot 0x00000005
   nsects 5
    flags 0x0
Section
  sectname __text
   segname __TEXT
      addr 0x0000000100000fe0
      size 0x0000000000000e05
    offset 4064
     align 2^4 (16)
    reloff 0
    nreloc 0
     flags 0x80000400
 reserved1 0
 reserved2 0
Section
  sectname __stubs
   segname __TEXT
      addr 0x0000000100001de6
      size 0x0000000000000024
    offset 7654
     align 2^1 (2)
    reloff 0
    nreloc 0
     flags 0x80000408
 reserved1 0 (index into indirect symbol table)
 reserved2 6 (size of stubs)
Section
  sectname __stub_helper
   segname __TEXT
      addr 0x0000000100001e0c
      size 0x000000000000004c
    offset 7692
     align 2^2 (4)
    reloff 0
    nreloc 0
     flags 0x80000400
 reserved1 0
 reserved2 0
Section
  sectname __cstring
   segname __TEXT
      addr 0x0000000100001e58
      size 0x000000000000015b
    offset 7768
     align 2^0 (1)
    reloff 0
    nreloc 0
     flags 0x00000002
 reserved1 0
 reserved2 0
Section
  sectname __unwind_info
   segname __TEXT
      addr 0x0000000100001fb4
      size 0x0000000000000048
    offset 8116
     align 2^2 (4)
    reloff 0
    nreloc 0
     flags 0x00000000
 reserved1 0
 reserved2 0

Here is a load command which specifies the path to the dynamic linker:

Load command 7
          cmd LC_LOAD_DYLINKER
      cmdsize 32
         name /usr/lib/dyld (offset 12)

The Dynamic Linker

At this point, execve has loaded our program into memory and provided us with argc, argv, and envp. The path to the dynamic linker is retrieved from the Mach-O header, and execve invokes it.

The OS X dynamic linker is called dyld. There are actually two distinct dyld components on OS X:

  • /usr/lib/dyld, the dynamic linker application
  • /usr/lib/system/libdyld.dylib, the dynamic library which provides dynamic linking functionality to the target program during runtime

At a high level, the dynamic linker performs the following steps:

  1. Handles initial program startup behavior
  2. Loads all of the shared libraries that our program links against into the program's address space
  3. Searches the libraries and binds symbols as required to start the program (i.e., all non-lazy references)
    1. Binding symbols is a complex topic that we are glossing over; for more information see Apple's Binding Symbols overview
  4. Bound symbol addresses are placed into sections corresponding to the entries in the indirect symbol table (defined by the LC_DYSYMTAB load command)
  5. Dynamic linker functions (from libdyld.dyld) are placed into memory so that our program can interact with the dynamic linker during runtime (e.g. to load more libraries or bind additional symbols)
  6. Runtime setup occurs, including calling global constructors registered by dynamically linked libraries
  7. The dynamic linker calls the program's entry function.

Some of the required dyld information is encoded in the Mach-O header, such as arrays of symbols which must be bound:

Load command 4
            cmd LC_DYLD_INFO_ONLY
        cmdsize 48
     rebase_off 12288
    rebase_size 16
       bind_off 12304
      bind_size 24
  weak_bind_off 0
 weak_bind_size 0
  lazy_bind_off 12328
 lazy_bind_size 160
     export_off 12488
    export_size 320

The dynamic libraries which must be loaded are encoded in the Mach-O header. Our test program loads two dynamic libraries: libSystem and libcmocka.

Load command 12
          cmd LC_LOAD_DYLIB
      cmdsize 56
         name /usr/lib/libSystem.B.dylib (offset 24)
   time stamp 2 Wed Dec 31 16:00:02 1969
      current version 1252.200.5
compatibility version 1.0.0
Load command 13
          cmd LC_LOAD_DYLIB
      cmdsize 72
         name /usr/local/opt/cmocka/lib/libcmocka.0.dylib (offset 24)
   time stamp 2 Wed Dec 31 16:00:02 1969
      current version 0.5.1
compatibility version 0.0.0

The LC_DYSYMTAB command contains addresses and counts for the dynamic symbol table.

Load command 6
            cmd LC_DYSYMTAB
        cmdsize 80
      ilocalsym 0
      nlocalsym 15
     iextdefsym 15
     nextdefsym 16
      iundefsym 31
      nundefsym 7
         tocoff 0
           ntoc 0
      modtaboff 0
        nmodtab 0
   extrefsymoff 0
    nextrefsyms 0
 indirectsymoff 13456
  nindirectsyms 14
      extreloff 0
        nextrel 0
      locreloff 0
        nlocrel 0

The entry point for our program is specified by the LC_MAIN command in the Mach-O header. By default, LC_MAIN is configured to point to the main function. This can be overridden using the -e linker flag if a different entry point is desired. Prior to OS X 10.8, an LC_UNIXTHREAD command was used to indicate the entry point. Programs using LC_UNIXTHREAD link against a crt0.o object which provides startup functionality. We will largely gloss over LC_UNIXTHREAD in this analysis.

Regardless of the function used to enter our program, the entryoff value in the LC_MAIN command points to the offset in the binary where our starting function is located.

Load command 11
       cmd LC_MAIN
   cmdsize 24
  entryoff 4064
 stacksize 0

The offset value of 4064 (hex 0x1200), corresponds to the start of the __TEXT.__text section, which is also the start of main function for our test program.

Offset | Data | description
                | 0x100001200 (_main)
00001200 | 55 | pushq %rpb...

If you want to play around further with dyld, I recommend this Debugging dyld article, which highlights options that can be used to see what libraries are being loaded and a trace of functions that are called.

dyld Source Code Analysis

Now that we have a general overview of dyld, let's dig into the source code. You can browse the source code online or download a tarball of the source code. The project contains sources for both dyld and libdyld.dylib.

One thing to note up front is that dyld and libdyld can run on OS X or iOS. Assembly files support four distinct variants: x86, x86_64, arm, and aarch64 (also known as arm64). The variant that is used depends on the target.

We will not include full file implementations for assembly files. Instead, we will focus on x86_64 assembly variants since we are analyzing an OS X program. We will also be ignoring iOS Simulator code.

__dyld_start

The __dyld_start function is the entry point for the dyld program This function is defined in src/dyldStartup.s.

The function opens with a helpful preamble that shows us how the kernel sets up the stack frame for __dyld_start:

/*
 * C runtime startup for interface to the dynamic linker.
 * This is the same as the entry point in crt0.o with the addition of the
 * address of the mach header passed as the an extra first argument.
 *
 * Kernel sets up stack frame to look like:
 *
 *  | STRING AREA |
 *  +-------------+
 *  |      0      |
*   +-------------+
 *  |  apple[n]   |
 *  +-------------+
 *         :
 *  +-------------+
 *  |  apple[0]   |
 *  +-------------+
 *  |      0      |
 *  +-------------+
 *  |    env[n]   |
 *  +-------------+
 *         :
 *         :
 *  +-------------+
 *  |    env[0]   |
 *  +-------------+
 *  |      0      |
 *  +-------------+
 *  | arg[argc-1] |
 *  +-------------+
 *         :
 *         :
 *  +-------------+
 *  |    arg[0]   |
 *  +-------------+
 *  |     argc    |
 *  +-------------+
 * sp-> |      mh     | address of where the a.out's file offset 0 is in memory
 *  +-------------+
 *
 *  Where arg[i] and env[i] point into the STRING AREA
 */

We see some typical assembly preamble. There is a declaration for a static symbol which points to __dyld_start:

.data
    .align 3
__dyld_start_static:
    .quad   __dyld_start

And the preamble for the __dyld_start function itself:

.text
    .align 2,0x90
    .globl __dyld_start
__dyld_start:

The first parameter on the stack is the Mach-o Header address. This is moved into the rdi register, which holds the first function input argument.

popq    %rdi        # param1 = mh of app

Next, the stack pointer (rsp) is initialized using the frame pointer (rbp). Then the stack pointer is aligned per the ABI requirements. Storage is allocated for local variables.

pushq   $0      # push a zero for debugger end of frames marker
    movq    %rsp,%rbp   # pointer to base of kernel frame
    andq    $-16,%rsp       # force SSE alignment
    subq    $16,%rsp    # room for local variables

Once we've performed our initial setup, we prepare function arguments required for the __ZN13dyldbootstrap5startEPK12macho_headeriPPKclS2_Pm function. Now, that long and strange function name is a mangled C++ name. We can find the human readable version using c++filt:

06:05:38 dyld-635.2$ c++filt __ZN13dyldbootstrap5startEPK12macho_headeriPPKclS2_Pm
dyldbootstrap::start(macho_header const*, int, char const**, long, macho_header const*, unsigned long*)

The demangled function name also shows us the arguments types, which gives us more context for the function call setup.The function arguments are loaded from the stack to the argument registers per the calling convention.

# call dyldbootstrap::start(app_mh, argc, argv, slide, dyld_mh, &startGlue)
    movl    8(%rbp),%esi    # param2 = argc into %esi
    leaq    16(%rbp),%rdx   # param3 = &argv[0] into %rdx
    movq    __dyld_start_static(%rip), %r8
    leaq    __dyld_start(%rip), %rcx
    subq     %r8, %rcx  # param4 = slide into %rcx
    leaq    ___dso_handle(%rip),%r8 # param5 = dyldsMachHeader
    leaq    -8(%rbp),%r9
    call    __ZN13dyldbootstrap5startEPK12macho_headeriPPKclS2_Pm

The dyldbootstrap::start returns the address to the target program's entry function. There is some preparatory work required before launching the target program.

First, the assembly reads the stack value which represents the final argument to dyldboostrap::start: uintptr_t* startGlue. We'll see where this is set later, but the address is set to 0 if LC_UNIXTHREAD is used. Otherwise, it is set to an address for a start glue function in libdylib.ld. This glue function is used to provide a false backtrace from main.

If LC_MAIN is not used (startGlue, now in rdi, is 0), the stack is restored to its original unaligned value, the Mach-O header address is removed, and the frame pointer is reset to 0. These will be setup again by the crt0.o _start function.

movq    -8(%rbp),%rdi
    cmpq    $0,%rdi
    jne Lnew

        # clean up stack and jump to "start" in main executable
    movq    %rbp,%rsp   # restore the unaligned stack pointer
    addq    $8,%rsp     # remove the mh argument, and debugger end frame marker
    movq    $0,%rbp     # restore ebp back to zero
    jmp *%rax       # jump to the entry point

For the LC_MAIN case, which applies to our analysis, different setup steps are performed:

  1. Variables local to __dyld_start are removed
  2. A false return address is loaded onto the stack, which points to libdyld's _start function instead of __dyld_start
  3. argc is loaded into the first argument register (rdi)
  4. argv is loaded into the second argument register (rsi)
  5. envp is loaded into the third argument register (rdx)
  6. The start of the apple array is located and loaded into the fourth argument register (rcx)
# LC_MAIN case, set up stack for call to main()
Lnew:   addq    $16,%rsp    # remove local variables
    pushq   %rdi        # simulate return address into _start in libdyld
    movq    8(%rbp),%rdi    # main param1 = argc into %rdi
    leaq    16(%rbp),%rsi   # main param2 = &argv[0] into %rsi
    leaq    0x8(%rsi,%rdi,8),%rdx # main param3 = &env[0] into %rdx
    movq    %rdx,%rcx
Lapple: movq    (%rcx),%r8
    add $8,%rcx
    testq   %r8,%r8     # look for NULL ending env[] array
    jne Lapple      # main param4 = apple into %rcx

Once everything is configured, the program jumps to the LC_MAIN address.

jmp *%rax       # jump to main(argc,argv,env,apple) with return address set to _start

Our next stop is dyldbootstrap::start.

dyldbootstrap::start

The function is defined in src/dyldInitialization.cpp. Everything this file is placed under the namespace dyldbootstrap.

The start function is used to get dyld itself into a runnable state. These setup steps are normally handled for target programs by dyld, but the same setup is required for dyld itself to run.

uintptr_t start(const struct macho_header* appsMachHeader, int argc, const char* argv[], 
                intptr_t slide, const struct macho_header* dyldsMachHeader,
                uintptr_t* startGlue)

First, the function checks whether this is a position-independent executable and whether dyld needs to be relocated. We will gloss over these details.

// if kernel had to slide dyld, we need to fix up load sensitive locations
    // we have to do this before using any global variables
    slide = slideOfMainExecutable(dyldsMachHeader);
    bool shouldRebase = slide != 0;
#if __has_feature(ptrauth_calls)
    shouldRebase = true;
#endif
    if ( shouldRebase ) {
        rebaseDyld(dyldsMachHeader, slide);
    }

Next, there is some runtime initialization. The mach_init() function is contained in Apple's libc. The mach_init function initializes Mach Messaging, which provides IPC support.

// allow dyld to use mach messaging
    mach_init();

The envp and apple pointers are properly initialized:

// kernel sets up env pointer to be just past end of agv array
    const char** envp = &argv[argc+1];

    // kernel sets up apple pointer to be just past end of envp array
    const char** apple = envp;
    while(*apple != NULL) { ++apple; }
    ++apple;

And the apple pointer is used to set up a value for the stack overflow guard. Interestingly, dyld provides its own stack protector routines. The __guard_setup function is defined in src/glue.c.

// set up random value for stack canary
    __guard_setup(apple);

Once setup is complete, dyld::_main is invoked:

// now that we are done bootstrapping dyld, call dyld's main
    uintptr_t appsSlide = slideOfMainExecutable(appsMachHeader);
    return dyld::_main(appsMachHeader, appsSlide, argc, argv, envp, apple, startGlue);

dyld::_main

The dyld::_main function is implemented at src/dyld.cpp.

uintptr_t
_main(const macho_header* mainExecutableMH, uintptr_t mainExecutableSlide, 
        int argc, const char* argv[], const char* envp[], const char* apple[], 
        uintptr_t* startGlue)

This function is the functional entry point for the dyld program. This function returns the address of the LC_MAIN function in the target program. This address is used by __dyld_start to invoke that program.

There's a lot going on here, and I'm simplifying some of the logic for the purposes of this analysis. Don't be surprised when you look at dyld.cpp and see things I've left out. I will be providing a verbal summary of many helper functions rather than clutter this analysis with their details. I've also removed the following code to simplify the function:

  • Debugging code, such as:
    • kdebug trace functions
    • CRSetCrashLogMessage calls
    • Print options that are enabled by environment variable settings
  • iOS simulator ifdefs
  • arm64 ifdefs
  • __MAC_OS_X_VERSION_MIN_REQUIRED ifdefs
  • SUPPORT_ACCELERATE_TABLES ifdefs
  • SUPPORT_OLD_CRT_INITIALIZATION ifdefs
  • SUPPORT_VERSIONED_PATHS ifdefs
  • ptrauth_calls
  • gdb notify functions
  • sSkipMain logic, which is used for validating dyld itself
  • Monitoring code

First, the CDHash for the target program is read from the apple buffer. This hash is used to validate that the image is properly signed.

// Grab the cdHash of the main executable from the environment
    uint8_t mainExecutableCDHashBuffer[20];
    const uint8_t* mainExecutableCDHash = nullptr;
    if ( hexToBytes(_simple_getenv(apple, "executable_cdhash"), 40, mainExecutableCDHashBuffer) )
        mainExecutableCDHash = mainExecutableCDHashBuffer;

Variables are declared and initialized:

uintptr_t result = 0;
    sMainExecutableMachHeader = mainExecutableMH;
    sMainExecutableSlide = mainExecutableSlide;

The arguments to _main are passed to the setContext function, which initializes a global ImageLoader::LinkContext structure with the appropriate values:

setContext(mainExecutableMH, argc, argv, envp, apple);

The executable_path environment variable is accessed from the apple array and made into an absolute path. A "short name", which represents the binary name without a path, is also captured.

// Pickup the pointer to the exec path.
    sExecPath = _simple_getenv(apple, "executable_path");

    // <rdar://problem/13868260> Remove interim apple[0] transition code from dyld
    if (!sExecPath) sExecPath = apple[0];

    if ( sExecPath[0] != '/' ) {
        // have relative path, use cwd to make absolute
        char cwdbuff[MAXPATHLEN];
        if ( getcwd(cwdbuff, MAXPATHLEN) != NULL ) {
            // maybe use static buffer to avoid calling malloc so early...
            char* s = new char[strlen(cwdbuff) + strlen(sExecPath) + 2];
            strcpy(s, cwdbuff);
            strcat(s, "/");
            strcat(s, sExecPath);
            sExecPath = s;
        }
    }

    // Remember short name of process for later logging
    sExecShortName = ::strrchr(sExecPath, '/');
    if ( sExecShortName != NULL )
        ++sExecShortName;
    else
        sExecShortName = sExecPath;

Process restrictions are applied by dyld, which updates the global ImageLoader::LinkContext structure.

configureProcessRestrictions(mainExecutableMH);

Next, dyld checks the environment variables passed to the program to see if there are any that apply to dyld (e.g., DYLD_FRAMEWORK_PATH, DYLD_IMAGE_SUFFIX). All dyld-related environment variables are captured and handled within the checkEnvironmentVariables call chain. If DYLD_FALLBACK_FRAMEWORK_PATH or DYLD_FALLBACK_LIBRARY_PATH environment variables were not passed to the application, then default values are applied by defaultUninitializedFallbackPaths.

checkEnvironmentVariables(envp);
    defaultUninitializedFallbackPaths(envp);

The host CPU type (e.g. CPU_TYPE_X86_64) and subtype (e.g., CPU_SUBTYPE_X86_64_H for Haswell) are stored by the getHostInfo function:

getHostInfo(mainExecutableMH, mainExecutableSlide);

Unless the linker context has been told to not use a shared region, the global shared cache will be initialized and its address stored in the the global ImageLoader::LinkContext structure. This global cache contains all system libraries and can be used to cache dyld closure information for an app to reduce load times. In short, a closure contains all the information needed to launch an application; you can learn more here and here.

if ( gLinkContext.sharedRegionMode != ImageLoader::kDontUseSharedRegion ) {
        mapSharedCache();
    }

We're going to skip the closure processing for verbosity reasons, but we are still mentioning it because this is a potential return point for the _main function.

Following the mapping of the shared cache, the cache is checked to see if there is a relevant closure for the target program. If one is found, dyld tries to use the closure to launch the application. We'll see the process in greater detail later, but the launch process ensures that dylib images are loaded, libdyld is notified of the program's variables, initializers are called, the startGlue variable is set to the correct libdyld start function, and the entry address is correctly set for the target program.

If the closure was successfully launched, the address of the entry function will have been stored in result and we can return from _main:

if ( mainClosure != nullptr ) {
    bool launched = launchWithClosure(mainClosure, sSharedCacheLoadInfo.loadAddress, (dyld3::MachOLoaded*)mainExecutableMH,
                                              mainExecutableSlide, argc, argv, envp, apple, &result, startGlue);

    if ( launched ) {
        return result;
    }
}

If no closure was found, or the global cache was not enabled, dyld continues with the standard launch procedure.

A variety of containers have storage pre-allocated:

// make initial allocations large enough that it is unlikely to need to be re-alloced
    sImageRoots.reserve(16);
    sAddImageCallbacks.reserve(4);
    sRemoveImageCallbacks.reserve(4);
    sAddLoadImageCallbacks.reserve(4);
    sImageFilesNeedingTermination.reserve(16);
    sImageFilesNeedingDOFUnregistration.reserve(8);

We then enter a massive try/catch block.

try {
    // ... up next
}
catch(const char* message) {
    syncAllImages();
    halt(message);
}
catch(...) {
    dyld::log("dyld: launch failed\n");
}

Inside the try block is where the bulk of loading happens. First, dyld itself is added to a UUID list to enable symbolification of stack snapshots involving dyld.

addDyldImageToUUIDList();

Next, the executable's Mach-O header is checked for compatibility with dyld, and then ImageLoader is instantiated for the target program. The global ImageLoader::LinkContext structure is updated with the new ImageLoader handle. The link context structure also stores a bool indicating whether an LC_CODE_SIGNATURE command is found in the Mach-O header.

There is additional logic to determine whether old Mach-O binaries are supported; for our current analysis, we will assume that strict binaries are used.

sMainExecutable = instantiateFromLoadedImage(mainExecutableMH, mainExecutableSlide, sExecPath);
        gLinkContext.mainExecutable = sMainExecutable;
        gLinkContext.mainExecutableCodeSigned = hasCodeSignatureLoadCommand(mainExecutableMH);
        gLinkContext.strictMachORequired = true;

Another container has space pre-allocated:

sAllImages.reserve(INITIAL_IMAGE_COUNT);

The dyld_all_image_infos list doesn't contain dyld, so the path is determined and stored in a global process info buffer:

// get path of dyld itself
        void*  addressInDyld = (void*)&__dso_handle;

        char dyldPathBuffer[MAXPATHLEN+1];
        int len = proc_regionfilename(getpid(), (uint64_t)(long)addressInDyld, dyldPathBuffer, MAXPATHLEN);
        if ( len > 0 ) {
            dyldPathBuffer[len] = '\0'; // proc_regionfilename() does not zero terminate returned string
            if ( strcmp(dyldPathBuffer, gProcessInfo->dyldPath) != 0 )
                gProcessInfo->dyldPath = strdup(dyldPathBuffer);
        }

If the DYLD_INSERT_LIBRARIES environment variable was set, dyld will attempt to load all of the specified libraries:

// load any inserted libraries
        if  ( sEnv.DYLD_INSERT_LIBRARIES != NULL ) {
            for (const char* const* lib = sEnv.DYLD_INSERT_LIBRARIES; *lib != NULL; ++lib) 
                loadInsertedDylib(*lib);
        }
        // record count of inserted libraries so that a flat search will look at 
        // inserted libraries, then main, then others.
        sInsertedDylibCount = sAllImages.size()-1;

Next, we link the target executable.

Multiple images may be found in a single executable, e.g. with a bundle. Each image will be added to a master image list. In addition, a mapping of each segment's start and end address will be stored. Next, all libraries referenced by each image are recursively loaded. The link function would normally bind symbols, but since the third argument (preflightOnly) is true, the link function will return once libraries are loaded.

// link main executable
gLinkContext.linkingMainExecutable = true;

link(sMainExecutable, sEnv.DYLD_BIND_AT_LAUNCH, true, ImageLoader::RPathChain(NULL, NULL), -1);

There's a lot of machinery to make library loading and symbol binding happen. For the purposes of our analysis (and the length of this article), I'm going to gloss over this process. You can find the implementation details in ImageLoader.cpp.

Additional attributes are set and checked for the target program:

sMainExecutable->setNeverUnloadRecursive();
        if ( sMainExecutable->forceFlat() ) {
            gLinkContext.bindFlat = true;
            gLinkContext.prebindUsage = ImageLoader::kUseNoPrebinding;
        }

Next, we perform the same link step for inserted libraries (those specified by the DYLD_INSERT_LIBRARIES environment variable):

// link any inserted libraries
        // do this after linking main executable so that any dylibs pulled in by inserted 
        // dylibs (e.g. libSystem) will not be in front of dylibs the program uses
        if ( sInsertedDylibCount > 0 ) {
            for(unsigned int i=0; i < sInsertedDylibCount; ++i) {
                ImageLoader* image = sAllImages[i+1];
                link(image, sEnv.DYLD_BIND_AT_LAUNCH, true, ImageLoader::RPathChain(NULL, NULL), -1);
                image->setNeverUnloadRecursive();
            }

Next, function interposing is configured and applied. Function interposing enables you to replace library functions with your own implementations, if needed.

// only INSERTED libraries can interpose
            // register interposing info after all inserted libraries are bound so chaining works
            for(unsigned int i=0; i < sInsertedDylibCount; ++i) {
                ImageLoader* image = sAllImages[i+1];
                image->registerInterposing(gLinkContext);
            }
        }

        // apply interposing to initial set of images
        for(int i=0; i < sImageRoots.size(); ++i) {
            sImageRoots[i]->applyInterposing(gLinkContext);
        }
        ImageLoader::applyInterposingToDyldCache(gLinkContext);

We note that the main executable linking is complete:

gLinkContext.linkingMainExecutable = false;

At this point, we can bind symbols from our loaded libraries. By default, only bind normal (non-lazy) symbols will be bound at this point, although the DYLD_BIND_AT_LAUNCH environment variable can be used to override that behavior.

// Bind and notify for the main executable now that interposing has been registered
        uint64_t bindMainExecutableStartTime = mach_absolute_time();
        sMainExecutable->recursiveBindWithAccounting(gLinkContext, sEnv.DYLD_BIND_AT_LAUNCH, true);
        uint64_t bindMainExecutableEndTime = mach_absolute_time();
        ImageLoaderMachO::fgTotalBindTime += bindMainExecutableEndTime - bindMainExecutableStartTime;
        gLinkContext.notifyBatch(dyld_image_state_bound, false);

        // Bind and notify for the inserted images now interposing has been registered
        if ( sInsertedDylibCount > 0 ) {
            for(unsigned int i=0; i < sInsertedDylibCount; ++i) {
                ImageLoader* image = sAllImages[i+1];
                image->recursiveBind(gLinkContext, sEnv.DYLD_BIND_AT_LAUNCH, true);
            }
        }

        // <rdar://problem/12186933> do weak binding only after all inserted images linked
        sMainExecutable->weakBind(gLinkContext);

There's a lot of machinery to make library loading and symbol binding happen. For the purposes of our analysis (and the length of this article), I'm going to gloss over this process. You can find the implementation details in ImageLoader.cpp.

We're in the home stretch! Our libraries are loaded and symbols are bound. Now we can safely call all of the initialization functions (e.g., those marked with __attribute__((constructor))) that were registered by our target program and the loaded libraries. We'll look at his function next.

// run all initializers
        initializeMainExecutable();

Once we've called all initialization functions, we find and set the entry point for our target program.

dyld looks for an LC_MAIN command in the Mach-O header. If this command is found, the address is calculated and returned. If there is no LC_MAIN command in the Mach-O header, NULL is returned. This would indicate a program using the old LC_UNIXTHREAD model.

// find entry point for main executable
        result = (uintptr_t)sMainExecutable->getEntryFromLC_MAIN();

If LC_MAIN was found, dyld finds the relevant startGlue function for the target architecture. This function is used as the return point for our target program's entry function (and to hide the backtrace of what happens before main).

If LC_MAIN was not found, startGlue is set to 0, and the entry function is read from the LC_UNIXTHREAD command.

if ( result != 0 ) {
            // main executable uses LC_MAIN, we need to use helper in libdyld to call into main()
            if ( (gLibSystemHelpers != NULL) && (gLibSystemHelpers->version >= 9) )
                *startGlue = (uintptr_t)gLibSystemHelpers->startGlueToCallExit;
            else
                halt("libdyld.dylib support not present for LC_MAIN");
        }
        else {
            // main executable uses LC_UNIXTHREAD, dyld needs to let "start" in program set up for main()
            result = (uintptr_t)sMainExecutable->getEntryFromLC_UNIXTHREAD();
            *startGlue = 0;
        }
    }

Finally, if we made it this far, we can return the entry point result:

return result;

We'll continue our investigation with initializeMainExecutable.

initializeMainExecutable

The initializeMainExecutable function is implemented in src/dyld.cpp.

void initializeMainExecutable()

This function calls all of the initialization functions that were identified in the target program and dynamically linked libraries.

First, initializers from the inserted dynamic libraries are invoked:

// run initializers for any inserted dylibs
    ImageLoader::InitializerTimingList initializerTimes[allImagesCount()];
    initializerTimes[0].count = 0;
    const size_t rootCount = sImageRoots.size();
    if ( rootCount > 1 ) {
        for(size_t i=1; i < rootCount; ++i) {
            sImageRoots[i]->runInitializers(gLinkContext, initializerTimes[0]);
        }
    }

Next, the initializers for the target program and its libraries are invoked:

// run initializers for main executable and everything it brings up 
    sMainExecutable->runInitializers(gLinkContext, initializerTimes[0]);

Before returning, we register a function with cxa_atexit to run static termination functions when the program exits. This function iterates through each loaded image and terminates it.

// register cxa_atexit() handler to run static terminators in all loaded images when this process exits
    if ( gLibSystemHelpers != NULL ) 
        (*gLibSystemHelpers->cxa_atexit)(&runAllStaticTerminators, NULL, NULL);

The target program runInitializers is actually implemented in src/ImageLoader.cpp.

void ImageLoader::runInitializers(const LinkContext& context, InitializerTimingList& timingInfo)
{
    uint64_t t1 = mach_absolute_time();
    mach_port_t thisThread = mach_thread_self();
    ImageLoader::UninitedUpwards up;
    up.count = 1;
    up.images[0] = this;
    processInitializers(context, thisThread, timingInfo, up);
    context.notifyBatch(dyld_image_state_initialized, false);
    mach_port_deallocate(mach_task_self(), thisThread);
    uint64_t t2 = mach_absolute_time();
    fgTotalInitTime += (t2 - t1);
}

The ImageLoader::processInitializers function recursively initializes each image, calling any initialization functions contained in the image. There's a long call chain here, but essentially dyld looks through the Mach-O load commands to identify initialization functions. Qualified functions include those specified in LC_ROUTINES_COMMAND, or functions in a LC_SEGMENT_COMMAND which have a section corresponding to the type S_MOD_INIT_FUNC_POINTERS.

If you're interested in how dyld goes about identifying and calling initializers, review src/ImageLoaderMachO.cpp.

Start Glue

Finally, we can explain the mysterious backtrace we encountered during our initial exploration. It came from dyld3/start_glue.s.

As we saw in __dyld_start, this function is used as the return address for our LC_MAIN function. When main returns, it will return to _start and call _exit.

The implementation perfectly matches what we saw in the disassembly:

.align 2
    .globl _start
    .private_extern _start
_start:
    nop        # <rdar://problem/10753356> backtraces of LC_MAIN binaries don't end in "start"
Lstart:
    movl    %eax,%edi        # pass result from main() to exit() 
    call    _exit
    hlt

From what I've gathered while reviewing the dyld source, this "fake" start function is used to hide dyld functions and arguments when you're capturing a backtrace.

libSystem

Next, we'll make a brief pitstop in libSystem, which is the collection of system libraries on OS X. You can browse the source online or download a tarball.

Our primary interest in libSystem is the libSystem_initializer function. This function is defined in init.c.

Because this function is marked with a constructor attribute, it will run when dyld loads the library and calls initializers. This is how the C runtime gets initialized for our target program.

// libSystem_initializer() initializes all of libSystem.dylib
// <rdar://problem/4892197>
__attribute__((constructor))
static void
libSystem_initializer(int argc,
              const char* argv[],
              const char* envp[],
              const char* apple[],
              const struct ProgramVars* vars)

I'm not going to into individual detail with this function. Instead, I'll leave the entire contents here so you can get an overview of the level of initialization performed by libSystem.

{
    static const struct _libkernel_functions libkernel_funcs = {
        .version = 3,
        // V1 functions
        .dlsym = dlsym,
        .malloc = malloc,
        .free = free,
        .realloc = realloc,
        ._pthread_exit_if_canceled = _pthread_exit_if_canceled,
        // V2 functions (removed)
        // V3 functions
        .pthread_clear_qos_tsd = _pthread_clear_qos_tsd,
    };

    static const struct _libpthread_functions libpthread_funcs = {
        .version = 2,
        .exit = exit,
        .malloc = malloc,
        .free = free,
    };

    static const struct _libc_functions libc_funcs = {
        .version = 1,
        .atfork_prepare = libSystem_atfork_prepare,
        .atfork_parent = libSystem_atfork_parent,
        .atfork_child = libSystem_atfork_child,
#if defined(HAVE_SYSTEM_CORESERVICES)
        .dirhelper = _dirhelper,
#endif
    };

    __libkernel_init(&libkernel_funcs, envp, apple, vars);

    __libplatform_init(NULL, envp, apple, vars);

    __pthread_init(&libpthread_funcs, envp, apple, vars);

    _libc_initializer(&libc_funcs, envp, apple, vars);

    // TODO: Move __malloc_init before __libc_init after breaking malloc's upward link to Libc
    __malloc_init(apple);

#if TARGET_OS_OSX
    /* <rdar://problem/9664631> */
    __keymgr_initializer();
#endif

    _dyld_initializer();

    libdispatch_init();
    _libxpc_initializer();

    // must be initialized after dispatch
    _libtrace_init();

#if !(TARGET_OS_EMBEDDED || TARGET_OS_SIMULATOR)
    _libsecinit_initializer();
#endif

#if TARGET_OS_EMBEDDED
    _container_init(apple);
#endif

    __libdarwin_init();

    __stack_logging_early_finished();

#if TARGET_OS_EMBEDDED && TARGET_OS_IOS && !__LP64__
    _vminterpose_init();
#endif

#if !TARGET_OS_IPHONE
    /* <rdar://problem/22139800> - Preserve the old behavior of apple[] for
     * programs that haven't linked against newer SDK.
     */
#define APPLE0_PREFIX "executable_path="
    if (dyld_get_program_sdk_version() < DYLD_MACOSX_VERSION_10_11){
        if (strncmp(apple[0], APPLE0_PREFIX, strlen(APPLE0_PREFIX)) == 0){
            apple[0] = apple[0] + strlen(APPLE0_PREFIX);
        }
    }
#endif

    /* <rdar://problem/11588042>
     * C99 standard has the following in section 7.5(3):
     * "The value of errno is zero at program startup, but is never set
     * to zero by any library function."
     */
    errno = 0;
}

Visual Summary

It's quite easy to be come lost when traveling through program boot land. Here is a visual guide which you can use to help keep your place in the flow. This diagram has been simplified and serves as a high-level overview of the boot flow. Many function call stacks are hidden from this view.

Simplified summary of the OS X program startup control flow. execve invokes dyld, which prepares the environment, loads linked libraries, binds symbols, calls initializers, and then invokes the target program’s main() function.

Startup Activity Checklist

In the first article of this series, we reviewed a broad range of startup activities that occur before main is called.

Here is a checklist of actions that were observed in the OS X program startup procedures:

  • [ ] Early low-level initialization of the processor/hardware
  • [x] Stack initialization
  • [x] Frame pointer initialization
  • [x] C/C++ runtime setup
    • [x] Handle relocations (some sections are copied from flash to RAM)
    • [x] Initialize .bss
    • [x] Call global constructors
    • [x] Prepare argc, argv
    • [x] Prepare environment variables
    • [x] Heap initialization
    • [x] stdio initialization
    • [x] Initialize exception support
    • [x] Register destructors and other exit-time functionality
  • [x] System scaffolding setup
    • [x] Threading support
    • [x] Thread local storage (via pthread)
    • [x] Buffer overrun detection
      • Sets up a stack canary value
    • [x] Run-time error checks
    • [ ] Locale settings
    • [ ] Math error handling
    • [ ] Math precision
  • [x] Jump to main
  • [x] Exit after main

Further Reading

Missing /usr/include after updating to OSX 10.14? Try this fix

Last week, I updated OSX to 10.14. After installing the XCode command line tools, I noticed that most of my projects were failing to compile. I did some poking around and found that /usr/include/ was missing.

It seems that Apple updated their software tools to look within the OSX SDK path for headers. Unfortunately, for most projects, this scheme doesn't work. Who knows what they were thinking - I doubt we're going to see a wave of projects converting to using xcode-select as a result.

To install headers at their old /usr/include location, simply run this package file:

/Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14. pkg

After installing the headers, you should be able to compile again.

Missing headers after an OSX Update? Try this!

I recently opened my Macbook to start working on a few personal projects, and suddenly I noticed that all of my projects were failing to compile. I was pretty surprised - I'm not one to leave my master branch in a broken state!

I finally paid attention to the problem the compiler was complaining about: all of the system headers were missing! Sure enough, I went searching the filesystem and found that the compiler wasn't crazy and that the files had been removed. After some investigation, I discovered that my Macbook had auto-updated OSX and Xcode the previous night. Aha!

If you ever end up in a situation where you suddenly discover that your system headers are missing, try reinstalling the Xcode command line tools:

xcode-select --install

Once everything has been reinstalled, your software should start building again!

(The ultimate question still remains: why the hell do they delete my headers during an update?!)