clang

Demystifying ARM Floating Point Compiler Options

When I first started bringing up new ARM platforms, I was pretty confused by the various floating point options such as -mfloat-abi=softfp or -mfpu=fpv4-sp-d16. I imagine this is confusing to other developers as well, so I'd like to share my ARM floating-point cheat sheet with the world.

An Overview of the ARM Floating-Point Architecture

Before we dive into compiler options, there are a few ARM floating-point details we should familiarize ourselves with: the ARM EABI, VFP, and NEON.

ARM EABI

An ABI is a specification which defines the rules that a generated program must follow to work with a specific platform or interface. The ARM EABI defines the rules for an ARM platform, and your compiler will build your program according to those rules.

The ARM EABI specification defines two incompatible ABIs: one which uses floating-point registers for function arguments, and another which does not. If we are not using hardware floating-point operations, we can simply build our program without using the floating-point compatible ABI.

Since ARM defines a standard floating-point instruction set, we can still utilize the floating-point ABI even if our chip does not support the actual hardware. If floating-point hardware is not present, the instructions will be trapped and executed by a floating-point emulation module instead. The only real difference in functionality is slower execution speed when using software emulation.

Since the ABI defines interfaces for our programs, we must compile and link all of our components and libraries using the same ABI.

Vector Floating Point (VFP)

Vector Floating Point (VFP) is the name for ARM's floating-point extension. Prior to ARMv8, VFP was implemented as a coprocessor extension. The VFP coprocessor supports both single and double-precision floating point operations according to the IEEE 754 standard. For practical purposes, VFP is not useful for vector operations and should be considered a normal scalar floating-point unit (FPU). VFP has been replaced with NEON as of ARMv8.

The VFP extensions are optional parts of the ARM architecture, though the majority of Cortex-A processors do provide a floating-point unit. Some Cortex-A8 devices may utilize a reduced VFPLite module instead of a full VFP module. This VFPLite module requires roughly a 10x increase in clock cycles per floating-point operation.

Some devices such as the ARM Cortex-A8 have a cut-down VFPLite module instead of a full VFP module, and require roughly ten times more clock cycles per float operation.[81]

VFP Versions

Here's a high-level summary of the different VFP versions that have been released throughout the years.

  • VFPv1
    • Obsoleted by ARM
  • VFPv2
    • 16 64-bit FPU registers
    • Optional extension to the ARM instruction set in the ARMv5TE, ARMv5TEJ, ARMv6, and ARMv6K architectures
    • Optional extension to the ARM and Thumb instruction set in the ARMv6T2 architecture
    • Supports standard FPU arithmetic (add, sub, neg, mul, div), full square root
  • VFPv3
    • Backwards compatible with VFPv2, except that it cannot trap floating-point exceptions
    • Adds VCVT instructions to convert between scalar, float and double
    • Adds immediate mode to VMOV such that constants can be loaded into FPU registers.
    • VFPv3-D32
      • 32 64-bit FPU registers
      • Implemented on most Cortex-A8 and A9 ARMv7 processors
    • VFPv3-D16
      • 16 64-bit FPU registers
      • Implemented on Cortex-R4 and R5 processors and the Tegra 2 (Cortex-A9).
    • VFPv3-F16
      • Uncommon
      • Supports IEEE754-2008 half-precision (16-bit) floating point as a storage format
    • VFPv3U
      • A variant of VFPv3 that supports the trapping of floating-point exceptions to support code.
      • Can support single- or half-precision floating point
  • VFPv4
    • Built on VFPv3
    • Adds half-precision support as a storage format
    • Adds fused multiply-accumulate instructions
    • VFPv4-D32
      • 32 64-bit FPU registers
      • Implemented on the Cortex-A12 and A15 ARMv7 processors
      • Cortex-A7 optionally has VFPv4-D32 (in the case of an FPU with NEON)
    • VFPv4-D16
      • 16 64-bit FPU registers
      • Implemented on Cortex-A5 and A7 processors (in case of an FPU without NEON)
    • VFPv4U
      • A variant of VFPv4 that supports the trapping of floating-point exceptions to support code
    • Can support single- or half-precision floating point
  • VFPv5
    • Implemented on Cortex-M7 when single and double-precision floating-point core option exists

NEON

NEON, the "Advanced Single Instruction Multiple Data (SIMD) Extension", is ARM's successor to the VFP coprocessor. NEON is a VFP extension which allows for efficient matrix and vector data manipulation and is commonly used in signal-processing applications. Prior to ARMv8, the ARM architecture distinguished between VFP and NEON floating-point support. NEON was not fully IEEE 754 compliant, and there were instructions that VFP supported which NEON did not. These issues have been resolved with ARMv8.

NEON sports a combined 64- and 128-bit SIMD instruction set and shares the same floating-pointer registers as used in VFP. Some devices, such as the Cortex-A8 and Cortex-A9 lines, support 128-bit vectors but operate on 64 bits at a time. Newer processors such as the Cortex-A15 can operate on 128 bits at a time.

NEON remains an optional part of the ARM architecture. However, NEON is included in all Cortex-A8 devices.

SVE

The Scalable Vector Extension (SVE) is the next-generation ARM SIMD instruction set. Currently it is only targeting ARMv8-A and the aarch64 ISA.

Compiler Options

Now that we have a high level understanding of ARM floating-point technologies, let's take a look at the compiler options we can use. I will be providing information relevant to the GNU and clang toolchains. For more information on the ARM compiler options, please see this reference documentation.

Let's dive into the two major compilation options: -mfloat-abi and -mfpu.

float-abi

The -mfloat-abi=<name> option is used to select which ARM ABI is used. This option also controls whether floating-point instructions may be used.

Here are your float-abi options:

  • soft: full software floating-point support
  • softfp: Allows use of floating-point instructions but maintains compatibility with the soft-float ABI
  • hard: Uses floating-point instructions and the floating-point ABI.

Each target architecture has a default value which is used if no option is supplied.

Note well: the two ARM ABIs (hard-float and soft-float) are not link-compatible. Your entire program must be compiled using the same ABIs. If a pre-compiled library is not supplied with your target floating-point ABI, you will need to recompile it for your own purposes.

soft

The soft option enables full software floating-point support. The compiler will not generate FPU instructions in soft mode. Instead, the compiler generates library calls to handle floating point operations. The compiler also generates prologue and epilogue functions to pass floating-point arguments (float, double) into integer registers (one for float, two fordouble`).

When using the soft option, the -mfpu flag is ignored.

softfp

The softfp option is a hybrid between hard and soft. The compiler is allowed to generate hardware floating-point instructions, but it still uses the soft-float ABI. Like with soft, the compiler generates functions to pass floating-point arguments to integer registers. Depending on the chosen FPU (-mfpu), the compiler can choose when to use emulated or hardware floating-point instructions.

Since both soft and softfp use the same soft-float ABI, code built with either option can be linked together. However, when copying data from integer to floating-point registers, a pipeline stall is incurred for every copy. This additional overhead can impact the performance of your application, since data is being copied back-and-forth from the FPU registers when using floating-point arguments.

hard

The hard option enables full hardware floating-point support. The compiler generates floating-point instructions and uses the floating-point ABI. Floating-point function arguments are passed directly into FPU registers. Since there are no function prologue or epilogue requirements, no pipeline stalls are incurred with floating-point arguments. The hard float option will provide you with the highest performance, but does limit your compiled binary to the selected FPU.

When using the hard option, you must define an FPU using -mfpu.

fpu

When using the hard or softfp float-abi, you should specify the FPU type using the -mfpu compiler flag. This flag specifies what floating-point hardware (or emulation) is available on your target architecture. When using the soft-float ABI, fpu determines the format of the floating-point values.

The -mfpu=<name> option supports the following FPU types: vfp, vfpv3, vfpv3-fp16, vfpv3-d16, vfpv3-d16-fp16, vfpv3xd, vfpv3xd-fp16, neon, neon-fp16, vfpv4, vfpv4-d16, fpv4-sp-d16, neon-vfpv4, fp-armv8, neon-fp-armv8, and crypto-neon-fp-armv8.

Each of the FPU options corresponds to the floating-point architectures described above, and some options represent supersets. If you don't care about the specific VFP type, you can select supersets (vfp, neon). You can also generalize VFP versions as supersets (vfpv3, vfpv4).

fp16-format

The-mfp16-format=<name> option allows you to specify the format of the half-precision floating-point type (__fp16). Valid options are none, ieee, and alternative. The default option is none, meaning __fp16 is not defined.

For more information, see the GNU Half-Precision Floating Point documentation.

Performance Impacts

In general, applications relying on floating-point operations will benefit from using the hard-float ABI.

Debian has some notes on VFP performance improvements and cite a proof-of-concept Ubuntu build which noted significant performance improvements with floating-point heavy libraries.

Further Reading

-Werror is Not Your Friend

I have never quite understood the obsession with the -Werror compiler flag. I regularly come across projects with the flag enabled, and it's not uncommon for me to fend off rabid developers who want the flag enabled in projects I work on.

In case you have been living under a rock, -Werror is a compiler flag that causes all warnings to be treated as build errors. On the surface, the stated motivation behind enabling -Werror are benevolent. Developers who enable -Werror are making a statement: we care about our code base, and we won't accept warnings here. I also maintain a 0-warning policy for my projects, and I hate when developers ignore warnings. I understand the motivation for enabling the -Werror flag.

However, from the project maintenance perspective, -Werror is not your friend. I am always frustrated when I find a project with -Werror, because inevitably my first clean build of the project fails due to a spectacular mess of warnings. If I made no changes to the source code, why the hell is not not compiling?

-Werror creates a project dependency on a specific compiler version. Even worse, this toolchain dependency is often not recognized by the development team and is therefore not noted anywhere. I need to scour the web to find the secret dependency link, or I need to start hacking up the project to get the build to finish. Is that really the experience you want your consumers to have when they use your project?

-Werror lays the groundwork for maintenance headaches. When a new compiler version is released, new warnings are added or other risk areas are discovered. These new warnings will now cause your previously working build to fail, often for no good reason. Since many developers have the "never update" mindset, these new warnings go unnoticed until someone on the team eventually updates. These failures are often localized rather than systematic, so the team as a whole tends to overlook the effect of -Werror:

  • Your build server doesn't work since the server software was updated, causing your build guru to spend time investigating and rolling back software
  • A single developer updated and now must assume the burden of fixing new warnings before resuming the actual work
  • Your new hire can't get your software compiling, and time is wasted finding out that it's the toolchain version that matters

Furthermore, there are lots of warnings that don't need to cause build failures, such as -Wunknown-pragmas. I am in the habit of using #pragma mark in my projects to provide nicer editor interactions. If I use an older GNU toolchain then #pragma mark is unrecognized and generates a warning - but it doesn't affect my final binary at all!

To get around issues like that, now you need to start disabling individual errors that you don't want: -Wno-error=unknown-pragmas. You have to maintain these settings for all new benign warnings that get added.

I don't say all of this to support tolerating warnings in your project. In my projects, I fix all warnings and continually drive the teams I work with to get to 0 warnings. My Jenkins builds all have a warning graph so I can see the warning trend over time, when they are introduced, and who regularly introduces them.

Rather than having all warnings turned into errors, I think that warnings that lead to major problems or are often ignored should be selectively promoted into errors. You can do this by specifying Werror=warning-name, which will cause that specific warning name (e.g. unknown-pragmas) to generate an error if it is encountered.

For example, a warning that I promote to an error is -Wreturn-type. This warning seems innocuous on the surface, but you can get into a dangerous situatione easily:

Missing return statement in function with return expected
aws.c:158:1: warning: control reaches end of non-void function [-Wreturn-type]

If your function should return a value but does not, your compiler is going to start picking up random garbage as the return value than the value you intended, leading to weird behavior and tricky bugs. Definitely worth being an error!

If you're still convined that you need to use -Werror, I suggest that you wire up a way to turn this behavior off, such as a make variable. Then, to disable it, developers can simply run:

$ make all WARNINGS_AS_ERRORS=n

This allows you to keep -Werror enabled by default but also enables developers from having to hack up your project if they are using a newer/older toolchain version with different warnings.

Before you enable -Werror on your projects, make sure that you really want to sign up for the maintenace headaches that come with it. You can utilize better strategies instead:

  • Promote specific warnings to errors
  • Track and drive down warning count using build metrics and developer feedback
  • Locally/globally disable benign warnings that you don't need to worry about in your project (e.g. -Wunknown-pragmas)

If you must enable -Werror, at least provide an easy method to disable the -Werror behavior.

compiler-rt

Updated: 20190426

As we're exploring bringing up a C/C++ runtime on our system, I'd like to share a very helpful resource for those using clang/llvm: compiler-rt.

Compiler-rt is an LLVM project that provides implementations of various builtin functions for a variety of architectures. This saves us a lot of heavy lifting when bringing up a new platform, as we can link compiler-rt instead of re-implementing these functions.

While most useful as a complete library, compiler-rt is also a useful source code resource if you need to implement these builtins with a different toolchain. Simply import the required builtin source into your project.

I'll let the compiler-rt project describe the builtins they provide:

builtins - a simple library that provides an implementation of the low-level target-specific hooks required by code generation and other runtime components. For example, when compiling for a 32-bit target, converting a double to a 64-bit unsigned integer is compiling into a runtime call to the __fixunsdfdi function. The builtins library provides optimized implementations of this and other low-level routines, either in target-independent C form, or as a heavily-optimized assembly.

builtins provides full support for the libgcc interfaces on supported targets and high performance hand tuned implementations of commonly used functions like __floatundidf in assembly that are dramatically faster than the libgcc implementations. It should be very easy to bring builtins to support a new target by adding the new routines needed by that target.

Table of Contents:

  1. Prerequisites
  2. Getting compiler-rt
  3. Building compiler-rt
  4. Using compiler-rt in Your Project
  5. Embedded Artistry compiler-rt
  6. Further Reading
  7. Change Log

Prerequisites

You will need the llvm-config binary on your platform. This binary is provided when you install llvm.

If you're using OSX, note that Apple does not provide llvm-config with Xcode, so you will need to install mainline llvm to get this binary. See my notes on installing clang/llvm on OSX.

Getting compiler-rt

You can checkout the compiler-rt source with svn:

svn co http://llvm.org/svn/llvm-project/compiler-rt/trunk compiler-rt

If you prefer git, check out the github mirror:

git clone git@github.com:llvm-mirror/compiler-rt.git

I'll leave the folder structure descriptions to the compiler-rt team:

include/ contains headers that can be included in user programs (for example, users may directly call certain function from sanitizer runtimes).
lib/ contains libraries implementations.
lib/builtins is a generic portable implementation of builtins routines.
lib/builtins/(arch) has optimized versions of some routines for the supported architectures.
test/ contains test suites for compiler-rt runtimes.

The lib/builtins/ folder contains the source for the various builtin functions. You can use these items piecemeal in your repository. This is useful if you just need to port specific functions or don't want to deal with installing clang or compiling compiler-rt.

Building compiler-rt

For those who are interested in the compiler-rt builtins library, let's continue our journey.

Once you have llvm-config on your system, you can build compiler-rt with the following commands:

$ mkdir build
$ cd build
$ cmake ../compiler-rt -DLLVM_CONFIG_PATH=/path/to/llvm-config
$ make

The build diectory is important - it's where cmake will place the resulting files.

For those following with homebrew, you could use this command:

$ cmake ../compiler-rt -DLLVM_CONFIG_PATH=$(brew --prefix llvm)/bin/llvm-config
$ make

If you don't have llvm-config, you can still build the project. For cross-compiling, follow the instructions here to configure your CMake build directory.

By default, make will build everything. If you want to build a limited subset, you can run make help and pick the specific items you want to build.

If you want to install libraries, run this additional command:

$ make install

I usually do not run the make install step.

Finding the Right Library

After your build is completed, change to the lib/builtin directory in your build folder. There you will likely see a massive list of files. Here's my example output from compiling with Apple Clang on OSX:

libclang_rt.builtins_arm64_ios.a
libclang_rt.builtins_armv7_ios.a
libclang_rt.builtins_armv7k_ios.a
libclang_rt.builtins_armv7s_ios.a
libclang_rt.builtins_i386_10.4.a
libclang_rt.builtins_i386_iossim.a
libclang_rt.builtins_i386_osx.a
libclang_rt.builtins_x86_64_10.4.a
libclang_rt.builtins_x86_64_iossim.a
libclang_rt.builtins_x86_64_osx.a
libclang_rt.builtins_x86_64h_osx.a
libclang_rt.cc_kext_arm64_ios.a
libclang_rt.cc_kext_armv7_ios.a
libclang_rt.cc_kext_armv7k_ios.a
libclang_rt.cc_kext_armv7s_ios.a
libclang_rt.cc_kext_i386_osx.a
libclang_rt.cc_kext_x86_64_osx.a
libclang_rt.cc_kext_x86_64h_osx.a
libclang_rt.hard_pic_armv7_macho_embedded.a
libclang_rt.hard_pic_armv7em_macho_embedded.a
libclang_rt.hard_pic_i386_macho_embedded.a
libclang_rt.hard_pic_x86_64_macho_embedded.a
libclang_rt.hard_static_armv7_macho_embedded.a
libclang_rt.hard_static_armv7em_macho_embedded.a
libclang_rt.hard_static_i386_macho_embedded.a
libclang_rt.hard_static_x86_64_macho_embedded.a
libclang_rt.soft_pic_armv6m_macho_embedded.a
libclang_rt.soft_pic_armv7_macho_embedded.a
libclang_rt.soft_pic_armv7em_macho_embedded.a
libclang_rt.soft_pic_armv7m_macho_embedded.a
libclang_rt.soft_static_armv6m_macho_embedded.a
libclang_rt.soft_static_armv7_macho_embedded.a
libclang_rt.soft_static_armv7em_macho_embedded.a
libclang_rt.soft_static_armv7m_macho_embedded.a

You only need one of these for your system, likely. Which one do you pick?

Here's a quick decoder:

  • hard vs soft: this is floating point. Is your platform configured to support hard or soft floating point operations?
  • static vs pic: is the code compiled as a static library, or are you compiling with position-independent-code? (PIC)
  • i386 vs armv7x: this will be dependent upon your platform's processor. You need to pick the instruction set to match.
  • The last portion of the name is the library format.

Generally, I end up picking this library for my purposes if I am compiling and linking on OSX:

libclang_rt.hard_pic_armv7_macho_embedded.a

Note the "macho embedded" format - this requires special parsing to use with your embedded system. We will investigate MACHO files further in a future article.

Using compiler-rt in your project

Since compiler-rt builtin libraries do not regularly need updates, I recommend pre-compiling compiler-rt into a library file that can be linked against in your project. It may be worth it to build compiler-rt on your build machine so you have a known source to retrieve updates from.

Once you have built compiler-rt, you can copy the desired library to your project's repository.

You will need to add the -L linker flag to get the location into the library search path. The -l linker flag can be used to include the library itself: -lcompiler_rt.

If you built compiler-rt on OSX, you ended up with a bunch of macho libraries. The macho format will require additional handling that will be described in a future article.

Embedded Artistry compiler-rt

We use Meson for our projects. We have a compiler-rt project that builds with Meson which will build for your native system. Cross-compilation for ARM is also supported using cross-files.

The Embedded Artistry compiler-rt produces static libraries only, because that's what we use on our embedded systems.

To build for your host machine, simply run make (after installing Meson):

$ make

For cross-compilation, you will need to supply a cross-compilation file when creating the build results directory. Some samples are provided in the build/cross folder, and you can create your own as needed.

meson buildresults --cross-file=build/cross/gcc/arm/nrf52840.txt

Change into the buildresults directory and build:

$ cd buildresults
$ ninja

In both cases, the static libraries will be in buildresults. If you enabled a cross-compilation build, a native llibrary and cross-compiled library will present.

Further Reading

Change Log

  • 20190426:
    • Added notes for cross-compiling on ARM
    • Added Table of Contents
    • Added notes on Embedded Artistry compiler-rt repo