Build System Rules and Algorithms

Following up on last week's article, "Recursive make Considered Harmful", this week I present "Build System Rules and Algorithms". In "Build System Rules and Algorithms", Mike Shal builds upon the recursive make paper, providing some more insight into the limitations of make and providing guidelines for improving build systems. Shal classifies build systems into two categories: alpha and beta systems. Shal points out that our basic build system strategies have been relatively fixed from the creation of make over 30 years ago - it's time for an improvement!

As I said last week: most of us build software hundreds of times a day. This fundamental process needs to be fast and correct - any limitations or errors adds extra overhead to our development.

Read "Build System Rules and Algorithms" here.

My Highlights

Starting out with a short blurb about why we use a build system:

A build system is commonly used during software development to reduce the time spent in the Edit-Compile-Test cycle. The developer will typically work with a fully-built project, then edit a few source files out of many and want to see the result of those changes in the final program. The basic theory behind the build system is that it is much faster to update only the parts of the project that require modification (due to the developer's initial edits) than it is to re-build the entire project.

Emphasizing why improving the build process is important:

A crucial observation here is that you have to run through the loop again and again to write a program, and so it follows that the faster the Edit-Compile-Test loop, the more productive you will be, down to a natural limit of instantaneous compiles.

However, the core algorithms used by these build systems haven't really evolved since the original directed acyclic graph (DAG) processing ones introduced by make 30 years ago.

Further, RMCH and derivative build systems do not address build correctness as it pertains to file renaming or deletion, which are fairly common and certainly useful operations during software development. While a global dependency view allows for correctness in the example provided in RMCH, that correctness is quickly broken if one of the C files is renamed, or the name of the target executable is changed. The cause of this brokenness will also be examined in this paper. Although this is a seemingly disparate problem from the scalability issue, the solution is actually fundamentally intertwined.

Shal launches into some definitions in his paper. I have copied interesting ones here for you to keep in mind as you continue:

a build system is any piece of software that provides facilities for constructing and parsing the DAG which represents the dependencies among files in a software project. This paper is not focused on the aspect of building software related to configuration options as would typically be handled by software such as autoconf or kconfig, even though some of these features have been introduced into build systems billed as replacements for make (which itself does not have such features built-in). When the build system is invoked, it will input the DAG and current filesystem state, and output a new DAG and/or files that represent the output of the build.

The global DAG is the entire dependency graph of the software system. This is the DAG used in a non-recursive make setup, or in any global "top-down" build system as is common in a make replacement. Figure 1 is an example of a global DAG, and is the same example as was used to illustrate the problem in RMCH. Note that the arrows in this graph are actually reversed from the graph in RMCH. Although we will see later in this paper that the arrows actually should go "up", in make's view the arrows go "down". Essentially RMCH displayed how the DAG should look, but was mis-representative of how the DAG is processed in make.

An incomplete DAG is any subgraph of the global DAG that does not have enough information to update a system while maintaining consistency. This is the kind of DAG used by an individual make process in a recursive make setup. The use of incomplete DAGs results in the problems described in RMCH. Figures 2 and 3 show two different incomplete DAGs, as might be seen by the two separate Makefiles in a recursive make setup. The greyed-out nodes and links are not part of the DAG - they are only shown to make it obvious what part of the global DAG is missing. An example of one of the issues that comes up from using recursive make is the fact that the main.o->parse.h link is not present in the second incomplete DAG. As described in RMCH, it is possible for make to construct parse.o from a new version of parse.h, but link in a main.o which was built from an old version of parse.h.

A partial DAG is any subgraph of the global DAG that (unlike an incomplete DAG) has enough information to update a system while maintaining consistency. Further, a minimal partial DAG is the smallest partial DAG in the set of partial DAGs that include a particular subset of root nodes. Figures 4 and 5 show the partial DAGs for the root nodes main.c and parse.y, respectively. Note that both of these examples are minimal partial DAGs. A partial DAG that includes both source files would actually consist of the entire global DAG.

An Alpha build system is any build system that uses a global view of dependencies in order to update a software project. Specifically, this includes make, as well as its aforementioned derivatives like SCons, ANT, and waf, among others. Note that make is considered an Alpha build system independent of whether a non-recursive or recursive style setup is used. In a non-recursive setup, the global view of the entire project is standard. In a recursive setup, one can imagine it as a set of independent Alpha build system executions, each of which has its own "global" view of an incomplete DAG.

A Beta build system is any build system that uses a partial DAG to update a software project. In particular, a Beta build system is marked by its use of the Beta Update Algorithm, which is described later in this paper.

After our definitions, Shal points out his "Build System Rules"

  1. Scalability: Given an already up-to-date system, applying a change to the system and building it should take time that is proportionate to the changes required to bring the system up to date.
  2. Correctness: Given an up-to-date system in state X, applying change A and updating the system may result in state Y. Undoing change A and performing an update must return the system to state X.
  3. Usability: There must not be different ways of building the system that the user must remember in order to accommodate rules 1) or 2).

Why is rule 1 necessary?

Rule 1) is necessary because updating an already built system is the most common operation done during development, and any time spent waiting for the program to update is time wasted.

Some quotes I just loved:

What good is a renaming capability in version control if the build system can't handle it?

It should be enough to tell the build system, "I'm done editing -- make it work.

There needs to be an "easy answer". There should only be one command to update the system, and you should be able to run it anytime, regardless of what you changed or how you changed it. Let the build system decide if only one subdirectory needs to be updated, or which statically linked binaries need to be built, or which stale files need to be removed. It should not be your responsibility to keep track of the whole system.

An overview of Alpha build system algorithm:

  1. Build DAG: Construct a global DAG from dependency files.
  2. Update: Walk through the DAG to see which targets are out of date and run the commands to update them.

The various build systems vary somewhat in their specific implementation. For example, make uses timestamps in step 2 while SCons uses file signatures, and ANT uses XML files instead of Makefiles in step 1. However, the basic algorithm is still the same. Both steps 1 and 2 in the algorithm presented here are linear time operations. It is actually more useful to investigate step 2 first to see how Alpha build systems determine which targets are out of date, and then use information from that to come back and re-visit step 1.

Some downsides to Alpha build systems:

Further, the timestamp() function is called for every file in the build tree. Although stat() is generally very fast, it still requires non-zero time. For thousands or millions of files it can no longer be considered an insignificant amount of time to read all of those directory entries.

The root of the problem is that an Alpha build system has no a priori information about what file modifications have taken place. The only way they can find out what has changed is by looking at each file in turn in order to compare timestamps or file signatures. Before looking into alternatives, let us consider why a file has changed in the first place. A file changes on the system because the developer performs some action -- for example, updating from source control, saving the file in an editor, creating or deleting a new file in the filesystem, etc. All of these actions could necessitate an update in the build system. However, all of these actions are 1) performed in a userspace program on a file or set of files; and 2) the file modifications end up going through the filesystem layer of the kernel. Let us assume for the time being that either the userspace program or the kernel saves a list of these changes for the build system. The list contains information like "files X and Y have been modified; file Z was removed".

Even with this a priori information, most Alpha build systems cannot make efficient use of it to perform the first step (constructing the DAG) in less than linear time. The issue here is one of dependency storage -- that is, the representation of the DAG on the disk. For example, in make this takes the form of several Makefiles along with separate dependency files that were generated from "gcc -MMD" (historically, makedepend) or something similar.

This is problematic -- the user may run some test cases which try to execute "program" and succeed because the file is still there. Pleased that the test cases pass, the developer checks in the changes. A new developer may come along and check out the project, only to find that the same test fails because "program" is never generated. This scenario is caused by the fact that the build system was supposed to generate system state B, but the actual result was state C. Ideally, the build system would remove the old target and go to state B. With the old file gone, the first developer would find the test case that tries to run "program" fails. This test could then be fixed before checking in the changes.

  • The root cause of this issue is that the user-generated dependency file (the Makefile) is the only location of the information that "program" was originally generated by the build system. When the user removes that bit of information from the Makefile, the build system therefore has no way of knowing that "program" needs to be removed in order to attain the correct system state. In short, make has no data persistence algorithm.

Now for Beta build system algorithm:

  1. Input file change list: Required a priori information for Beta build systems.
  2. Build partial DAG: Build a partial DAG based on the file change list and the complete (global) disk-based DAG.
  3. Update: Execute commands necessary to bring all the nodes in the partial DAG up-to-date.

This build system algorithm may seem counter to the conclusions reached in Recursive Make Considered Harmful. Specifically, that paper warns that build errors can occur when the build system uses an incomplete DAG. The distinguishing factor in a Beta build system is that the partial DAG built in step 2 is still correct. That is, all nodes that must be visited will still be visited, even though much of the on-disk DAG can be ignored for small updates.

Topological sort algorithms are linear with respect to the size of the DAG (nodes + edges), and then the foreach loop over the file_list also executes in O(n) time. That is, the update algorithm is linear with respect to the size of the partial DAG.

For the typical case of editing a few files, we can assume that the input file list is a constant O(1) size with respect to the project size. That is, regardless of how big the project is, the developer is focused on a small part of it.

One of the failings with some Alpha build systems is that there was no DAG that persisted across build invocations. For Beta build systems, it is necessary to have a specific DAG structure with a good search algorithm in order to satisfy the constraints imposed by the update procedures. This DAG structure must be separate from the "userspace" DAG elements that a developer designs in the build-system (for example, the notion that a .c file becomes a .o file is written by the developer, but the .h file that is included may be added to the DAG automatically when the file is compiled). This requirement of DAG persistence helps, but by no means solves the correctness rule.

Some notes on an example beta build system: tup

Aside from the Beta update algorithm, I do not believe tup's design choices described here are necessarily ideal -- there may be better alternatives. However, it does use per-directory build fragments. As mentioned previously, this allows tup to handle file additions and deletions (and by extension, re-naming) in logarithmic time. This can serve as a baseline to measure other possible implementations of Beta build systems. Tup also shows that satisfying all three rules is possible in practice.

Shal ends by stating that the following factors should be addressed by build systems:

  • Contrary to Recursive Make Considered Harmful, the update algorithm cannot rely on a global DAG and instead must use a partial DAG.
  • A file change list must be maintained and provided to the build system update operation.
  • Secondary storage for the graph must be used that is separate from the user-defined build input.
  • File/directory deletion and renaming must be treated as first class operations. about tup and its design choices can be found at http://gittup.org/tup/. There is more in-depth information about the performance comparison against the non-recursive make implementation, as well as a comparison to an oracle machine implementation. These comparisons are included in the source tree if you wish to perform your own tests.