Rules of Thumb

Embedded Rules of Thumb

Updated: 20190627

You may call them guidelines, heuristics, or rules of thumb. No matter, the purpose is the same: to provide a reasonable approximation of the truth. These rules of thumb can help guide your understanding of the systems you work on, focus you toward the right solutions, and highlight potential problem areas.

These are just the initial rules of thumb that I've collected over the past year. If you have any other useful rules or heuristics, please send me an email or leave a comment below.

Additional links to other rules of thumb are included in the Further Reading section.

The Rules of Thumb

  1. General
  2. Design
  3. Cost
  4. Scheduling
  5. Hardware
  6. Software Reuse
  7. Optimization
  8. Red Flags and Problem Areas
  9. Interrupts
  10. Function Point Rules of Thumb
  11. Further Reading

General

  • Move errors from run-time to compile time whenever possible
  • Programs without documentation have no value
  • Comments should never restate what the code obviously does.
  • Comments should aid maintenance by describing intention.
  • Everything in a header file should be used in at least two source files
  • Developer productivity is dramatically increased by eliminating distractions and interrupts
    • "Developers who live in cubicles probably aren't very productive. Check how they manage interruptions." (Jack Ganssle)
  • "Complexity grows exponentially; Robert Glass figures for every 25% increase in the problem's difficulty the code doubles in size. A many-million line program can assume a number of states whose size no human can grasp." (Jack Ganssle)
  • In nature, the optimum is almost always in the middle somewhere. Distrust assertions that the optimum is at an extreme point. (Akin's Laws)
  • Past experience is excellent for providing a reality check. Too much reality can doom an otherwise worthwhile design, though (Akin's Laws)

Design

  • Complex systems evolve out of simple systems that worked (John Gall)
    • "A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system." (John Gall)
  • If you can't describe the behavior in plain English, you can't successfully describe it with code
  • Decompose complex problems into smaller sub-problems
    • If a problem is can be decomposed into two or more independently solvable problems, then solve them independently first!
    • After you have implemented and tested the solutions, combine the parts into a larger operation
  • A function should perform only one conceptual task
  • Don't solve problems that don't exist
  • Solve the specific problem, not the general case
  • To design a spacecraft right takes an infinite amount of effort. This is why it's a good idea to design them to operate when some things are wrong. (Akin's Laws)
  • Design is an iterative process. The necessary number of iterations is one more than the number you have currently done. This is true at any point in time. (Akin's Laws)
  • There is never a single right solution. There are always multiple wrong ones, though. (Akin's Laws)
  • (Edison's Law) "Better" is the enemy of "good". (Akin's Laws)
  • (Shea's Law) The ability to improve a design occurs primarily at the interfaces. This is also the prime location for screwing it up. (Akin's Laws)

Cost

  • Software is expensive
    • "Study after study shows that commercial code, in all of the realities of its undocumented chaos, costs $15 to $30 per line. A lousy 1000 lines of code - and it's hard to do much in a thousand lines - has a very real cost of perhaps $30,000. The old saw 'it's only a software change' is equivalent to 'it's only a brick of gold bullion'." (Jack Ganssle)
    • "The answer is $15 to $40 per line of code. At the $40 end you can get relatively robust, well designed code suitable for industry applications. The $15 end tends to be code with skimpy design packages and skimpy testing. (In other words, some people spend only $15/line, but their code is of doubtful quality.)" (Phil Koopman)
    • "UPDATE, October 2015. It's probably more like $25-$50 per line of code now. Costs for projects outsourced to Asia have done up dramatically as wages and competition for scarce coders have increased." (Phil Koopman)
  • If you want to reduce software development costs, look at every requirements document and brutally strip out features. (Jack Ganssle)
    • Lots of features equals slow progress and expensive development (Jack Ganssle)
  • Non-recurring engineering (NRE) costs must be amortized over every product sold
    • Save NRE dollars by reducing features
    • Save NRE dollars by offloading software functionality into hardware components (increases BOM cost)
    • Save NRE dollars by delivering the product faster (Jack Ganssle)
  • It is easier and cheaper to completely rewrite the 5% of problematic functions than to fix the existing implementation
    • These functions cost four times as much as other functions (Barry Boehm)
    • "Perhaps we really blew it when first writing the code, but if we can identify these crummy routines, toss them out, and start over, we'll save big bucks." (Jack Ganssle)

Scheduling

  • There's never enough time to do it right, but somehow, there's always enough time to do it over. (Akin's Laws)
  • Estimating dates instead of hours guarantees a late project (Jack Ganssle)
    • "Scheduling disasters are inevitable when developers don't separate calendar time from engineering hours." (Jack Ganssle)
  • "If the schedule hallucinates a people-utilization factor of much over 50% the project will be behind proportionately." (Jack Ganssle)
    • "Some data suggests the average developer is only about 55% engaged on new product work. Other routine activities, from handling paperwork to talking about Survivor XVI, burn almost half the work week." (Jack Ganssle)
  • We often fail to anticipate the difficult areas of development
    • "Isn't it amazing how badly we estimate schedules for most projects? 80% of embedded systems are delivered late. Most pundits figure the average project consumes twice the development effort originally budgeted." (Jack Ganssle)
  • 5% of functions consume 80% of debugging time (Jack Ganssle)
    • "I've observed that most projects wallow in the debug cycle, which often accounts for half of the entire schedule. Clearly, if we can do something about those few functions that represent most of our troubles, the project will get out the door that much sooner." (Jack Ganssle)
  • Timelines grow much faster than firmware size - double the lines of code, and the delivery date increases by more than 2x (Barry Boehm)
  • "The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time." (Tom Cargill)
  • When porting old code to a new project, if more than about 25% gets modified there's not much of a schedule boost (Richard Selby)
  • Systems loaded to 90% of the processor capability require 2x development time over systems loaded at 70% or less. 95% loading triples development time. (Jack Ganssle)
    • "When only a few byte are left, even trivial features can take weeks as developers must rewrite massive sections of code to free up memory or CPU cycles." (Jack Ganssle)
  • The schedule you develop will seem like a complete work of fiction up until the time your customer fires you for not meeting it. (Akin's Laws)
  • Sometimes, the fastest way to get to the end is to throw everything out and start over. (Akin's Laws)
  • Sometimes, the fastest way to get to the end is to throw everything out and start over. (Akin's Laws)
  • (Patton's Law of Program Planning) A good plan violently executed now is better than a perfect plan next week. (Akin's Laws)

Hardware

  • Adding hardware increases power requirements
  • Use of hardware accelerators to offload CPU-based algorithms can reduce power requirements
  • Every sensor is a temperature sensor. Some sensors measure other things as well. (Elecia White)
  • Break out nasty real-time hardware functions into independent CPUs (Jack Ganssle)
    • Handling 1000 interrupts per second from a device? Partition it to its own controller and offload all of the ISR overhead off of the main processor
  • Add hardware whenever it can simplify the software (Jack Ganssle)
    • This will dramatically reduce NRE and software development costs, at a tradeoff for an increase in BOM costs.
    • Systems loaded to 90% of the processor capability require 2x development time over systems loaded at 70% or less. 95% loading triples development time. Add additional hardware to reduce loading. (Jack Ganssle)
  • (Atkin's Law of Demonstrations) When the hardware is working perfectly, the really important visitors don't show up. (Akin's Laws)

Software Reuse

  • Prefer to use existing, reviewed code that has already been re-used by others
    • e.g. use the STL instead of writing your own containers
  • Prefer simple, standard communication protocols over custom communication protocols
  • Follow the "Rule of Three": you are allowed to copy and paste the code once, but that when the same code is replicated three times, it should be extracted into a new procedure (Martin Fowler)
  • Before a package is truly reusable, it must have been reused at least three times (Jack Ganssle)
    • "We're not smart enough to truly understand the range of applications where a chunk of software may be used. Every domain requires its own unique features and tweaks; till we've actually used the code several times, over a wide enough range of apps, we won't have generalized it enough to have it truly reusable." (Jack Ganssle)
  • Reuse works best when done in large sections of code - think about reusing entire drivers or libraries, not functions (Jack Ganssle)
    • Richard Selby found that, when porting old code to a new project, if more than about 25% gets modified there's not much of a schedule boost

Optimization

  • Premature optimization is a waste of time
    • "More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason — including blind stupidity." (W.A. Wulf)
    • "The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet." (Michael A. Jackson)
  • Only optimize code after you have profiled it to identify the problem area
    • "Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you have proven that's where the bottleneck is." (Rob Pike)
    • "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified" (Donald Knuth)
  • The Pareto principle can be applied to resource optimization: 80% of resources are used by 20% of operations
    • Alternatively, there is the 90/10 law in software engineering: 90% of the execution time of a program is spent executing 10% of the code
  • Algorithmic optimizations have a greater impact than micro optimizations
    • "Real efficiency gains come from changing the order of complexity of the algorithm, such as changing from O(N^2) to O(NlogN) complexity"
  • Never sacrifice clarity for perceived efficiency, especially when the efficiency improvement has not been proven with data

Red Flags and Problem Areas

  • When developers are afraid to change a function, it's time to rewrite that code from scratch (Jack Ganssle)
  • Duplicate code is an indication of poor design or poor programming habits. It must be eliminated.
    • "Duplication is a bad practice because it makes code harder to maintain. When the rule encoded in a replicated piece of code changes, whoever maintains the code will have to change it in all places correctly. This process is error-prone and often leads to problems. If the code exists in only one place, then it can be easily changed there." (Jack Ganssle)
    • "This rule is can even be applied to small number of lines of code, or even single lines of code. For example, if you want to call a function, and then call it again when it fails, it's OK to have two call sites; however, if you want to try it five times before giving up, there should only be one call site inside a loop rather than 5 independent calls." (Jack Ganssle)
  • Avoid shared resources wherever possible (Jack Ganssle)
  • Eliminate globals!
  • Disabling interrupts tends to be A Bad Thing (Jack Ganssle)
    • Even in the best of cases it'll increase system latency and probably decrease performance
    • Increased latency leads to missed interrupts and mismanaged devices
  • Be wary of solo Enable Interrupt (EI) commands (Jack Ganssle)
    • "An EI located outside an interrupt service routine (ISR) often suggests peril - with the exception of the initial EI in the startup code." (Jack Ganssle)
    • "When the enable is not part of a DI/EI pair (and these two instructions must be very close to each other to keep latency down and maintainability up) then the code is likely a convoluted, cryptic well; plumbing these depths will age the most eager of developers." (Jack Ganssle)
  • Be wary when code is peppered with DI/EI Pairs (Jack Ganssle)
    • Excessive use of Disable Interrupt instructions suggests poor design
    • "But these DI/EI pairs slip into code in great numbers when there's a systemic design problem that yields lots of critical regions susceptible to reentrancy problems. You know how it is: chasing a bug the intrepid developer uncovers a variable trashed by context switching. Pop in quick DI/EI pair. Then there's another. And another. It's like a heroin user taking his last hit. It never ends." (Jack Ganssle)

Interrupts

  • Leave interrupts on, for all but the briefest times and in the most compelling of needs. (Jack Ganssle)
  • If you disable interrupts in a block of code, re-enable them in the same block (Jack Ganssle)
  • Keep ISRs small
    • Be wary of ISRs longer than half a page of code (Jack Ganssle)
    • In most cases, there should be little-to-no processing inside of the handler (Phillip Johnston)
    • Set a flag, add a value to a queue, and then rely on user-space code to handle more complex tasks
  • Minimize ISR latency to ensure the system does not miss interrupts (Jack Ganssle)
  • Check the design of any ISR that reenables interrupts immediately before returning (Jack Ganssle)
    • Minimize critical sections within the ISR.
    • "It's perfectly fine to allow another device to interrupt an ISR! or even to allow the same interrupt to do so, given enough stack space. That suggests we should create service routines that do all of the non-reentrant stuff (like servicing hardware) early, issue the EI, and continue with the reentrant activities. Then pop registers and return." (Jack Ganssle)

Avoid the following operations within interrupt handlers: (Phillip Johnston)

  • Don't declare any non-static variables inside the handler
  • Avoid blocking function calls
  • Avoid non-reentrant function calls
  • Avoid any processing that takes non-trivial time
  • Avoid operations with locks as you can deadlock your program in an ISR
  • Avoid operations that involve dynamic memory allocations, as the allocation may require a lock and will take a non-determinate amount of time
  • Avoid stack allocations
    • Depending on your architecture and operational model, your interrupt handler may utilize the stack of the interrupted thread or a common "interrupt stack".

Function Point Rules of Thumb

A function point is the measure of functionality of a part of a software, which you can read about here. One C function point is about 130 lines of code, on average.

Here are Capers's rules of thumb, where "FP" means function points. These were extracted from Jack Ganssle's newsletter.

  • Approximate number of bugs injected in a project: FP1.25
    • Manual code inspections will find about 65% of the bugs. The number is much higher for very disciplined teams.
  • Number of people on the project is about: FP/150
  • Approximate page count for paper documents associated with a project: FP1.15
  • Each test strategy will find about 30% of the bugs that exist.
  • The schedule in months is about: FP0.4
  • Full time number of people required to maintain a project after release: FP/750
  • Requirements grow about 2%/month from the design through coding phases.
  • Rough number of test cases that will be created: FP1.2

Further Reading

Change Log

  • 20181228:
    • Added another rule of thumb regarding complexity
  • 20181219:
    • Added additional metrics from Capers Jones
    • Added table of contents
    • Links now open in external tabs
    • Added additional links to Further Reading
  • 20190531:
    • Added laws from Akin's Laws of Spacecraft Design
  • 20190627:

Choosing the Right Container: Sequential Containers

In a previous article, I provided an overview of the C++ standard container types. We've also taken a look at general rules of thumb for selecting a container, as well as a more detailed look at rules of thumb for associative containers. Let's take a look at guidelines we can use to select the right sequential container.

Sequential Container Review

SequenceContainers should be used when you care that your memory is stored sequentially or when you want to access the data sequentially. Here's a quick summary of the sequential containers:

  • std::array - static contiguous array, providing fast access but with a fixed number of elements
  • std::vector - dynamic contiguous array, providing fast access but costly insertions/deletions
  • std::deque - double-ended queue providing efficient insertion/deletion at the front and back of a sequence
  • std::list and std::forward_list - linked list data structures, allowing for efficient insertion/deletion into the middle of a sequence.

Container Adapters

Container adapters are a special type of container class. They are not full container classes on their own, but wrappers around other container types (such as a vector, deque, or list). The container adapters encapsulate the underlying container type and limit the user interfaces accordingly.

The standard container adapters are:

  • stack - adapter providing an LIFO data structure
  • queue - adapter providing a FIFO data structure
  • priority_queue - adapter providing a priority queue, which allows for constant-time lookup of the largest element (by default)

General Rules of Thumb

There are some general rules of thumb that will guide you through most situations:

  • Use sequential containers when you need to access elements by position
  • Use std:vector as your default sequential container, especially as an alternative to built-in arrays
  • If you add or remove elements frequently at both the front and back of a container, use std::deque
  • Use a std::list (not std::deque) if you need to insert/remove elements in the middle of the sequence
  • Do not use std::list if you need random access to objects
  • Prefer std::vector over std::list if your system uses a cache
  • std::string is almost always better than a C-string
  • If you need to limit the interfaces, use a container adapter

Memory allocation may also be a factor in your decision. Here are the general rules of thumb for how the different sequential containers are storing memory:

  • std:vector, std::array, and std::string store memory contiguously and are compatible with C-style APIs
  • std::deque allocates memory in chunks
  • std::list allocates memory by node

Container Adapters

Container adapters take a container type and limit the interfaces to provide a specific data type. These containers will exhibit the characteristics of their underlying data structures, so you may not always want to utilize the default underlying container.

Generally you will want to:

  • Use std::stack when you need a LIFO data structure
  • Use std::queue when you need a FIFO data structure

std::priority queue is a bit more complex. The container utilizes a Compare function (std::less by default) and provides constant-time lookup of the largest (by default) element in the container. This structure is useful for implementing things like an ISR handler queue, where you always want to be working through the highest priority interrupt in the queue. This feature comes with an increased cost in element insertion.

Container Analysis

Let's take a look at the various sequential containers. Since the container adapters are simply add-ons to the primary sequential containers, we will not be evaluating those below. Select the underlying container appropriately for your use case, and use the adapter to implement the specific interface you need.

std::vector

Your default sequential containers should be a std::vector. Generally, std::vector will provide you with the right balance of performance and speed. The std::vector container is similar to a C-style array that can grow or shrink during runtime. The underlying buffer is stored contiguously and is guaranteed to be compatible with C-style arrays.

Consider using a std::vector if:

  • You need your data to be stored contiguously in memory
    • Especially useful for C-style API compatibility
  • You do not know the size at compile time
  • You need efficient random access to your elements (O(1))
  • You will be adding and removing elements from the end
  • You want to iterate over the elements in any order

Avoid using a std::vector if:

  • You will frequently add or remove elements to the front or middle of the sequence
  • The size of your buffer is constant and known in advance (prefer std::array)

Be aware of the specialization of std::vector<bool>: Since C++98, std::vector<bool> has been specialized such that each element only occupies one bit. When accessing individual boolean elements, the operators return a copy of a bool that is constructed with the value of that bit.

std::array

The std::array container is the most like a built-in array, but offering extra features such as bounds checking and automatic memory management. Unlike std::vector, the size of std::array is fixed and cannot change during runtime.

Consider using a std::array if:

  • You need your data to be stored contiguously in memory
    • Especially useful for C-style API compatibility
  • The size of your array is known in advance
  • You need efficient random access to your elements (O(1))
  • You want to iterate over the elements in any order

Avoid using a std::array if:

  • You need to insert or remove elements
  • You don't know the size of your array at compile time
  • You need to be able to resize your array dynamically

std::deque

The std::deque container gets its name from a shortening of "double ended queue". The std::deque container is most efficient when appending items to the front or back of a queue. Unlike std::vector, std::deque does not provide a mechanism to reserve a buffer. The underlying buffer is also not guaranteed to be compatible with C-style array APIs.

Consider using std::deque if:

  • You need to insert new elements at both the front and back of a sequence (e.g. in a scheduler)
  • You need efficient random access to your elements (O(1))
  • You want the internal buffer to automatically shrink when elements are removed
  • You want to iterate over the elements in any order

Avoid using std::deque if:

  • You need to maintain compatibility with C-style APIs
  • You need to reserve memory ahead of time
  • You need to frequently insert or remove elements from the middle of the sequence
    • Calling insert in the middle of a std::deque invalidates all iterators and references to its elements

std::list

The std::list and std::forward_list containers implement linked list data structures. Where std::list provides a doubly-linked list, the std::forward_list only contains a pointer to the next object. Unlike the other sequential containers, the list types do not provide efficient random access to elements. Each element must be traversed in order.

Consider using std::list if:

  • You need to store many items but the number is unknown
  • You need to insert or remove new elements from any position in the sequence
  • You do not need efficient access to random elements
  • You want the ability to move elements or sets of elements within the container or between different containers
  • You want to implement a node-wise memory allocation scheme

Avoid using std::list if:

  • You need to maintain compatibility with C-style APIs
  • You need efficient access to random elements
  • Your system utilizes a cache (prefer std::vector for reduced cache misses)
  • The size of your data is known in advance and can be managed by a std::vector

Up Next

That wraps up my general overview of C++ STL containers. In future articles, I will be using these STL containers to implement embedded systems programming constructs.

Further Reading

Debugging: 9 Indispensable Rules

The official title of this book is quite a mouthful: Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems, by David J. Agans.

Agans distills debugging techniques down to nine essential rules and includes engineering war stories that demonstrate these principles. This book helped me crystallize my debugging strategy and provided the language I now use to describe my techniques. This book also reinforced the importance of techniques that I have only loosly employed. I'm certain many of you are similarly guilty of violating the principles of "Change One Thing at a Time" and "Keep an Audit Trail"!

Here's the full list of rules:

  • Understand the System
  • Make it Fail
  • Quit Thinking and Look
  • Divide and Conquer
  • Change One Thing at a Time
  • Keep an Audit Trail
  • Check the Plug
  • Get a Fresh View
  • If You Didn't Fix it, It Ain't Fixed

I highly recommend reading this book if you want to improve your debugging chops and engineering skills. Even veteran debuggers will likely learn a thing or two.

Debuggers who naturally use these rules are hard to find. I like to ask job applicants, “What rules of thumb do you use when debugging?” It’s amazing how many say, “It’s an art.” Great — we’re going to have Picasso debugging our image-processing algorithm. The easy way and the artistic way do not find problems quickly.

You can find out more at Agans's website Debugging Rules!

Buy the Book

If you are interested in purchasing this book, you can support Embedded Artistry by using our Amazon affiliate link:

My Highlights

When it took us a long time to find a bug, it was because we had neglected some essential, fundamental rule; once we applied the rule, we quickly found the problem.

People who excelled at quick debugging inherently understood and applied these rules. Those who struggled to understand or use these rules struggled to find bugs.

Debuggers who naturally use these rules are hard to find. I like to ask job applicants, “What rules of thumb do you use when debugging?” It’s amazing how many say, “It’s an art.” Great—we’re going to have Picasso debugging our image-processing algorithm. The easy way and the artistic way do not find problems quickly.

Quality process techniques are valuable, but they’re often not implemented; even when they are, they leave some bugs in the system.

Once you have bugs, you have to detect them; this takes place in your quality assurance (QA) department or, if you don’t have one of those, at your customer site. This book doesn’t deal with this stage either—test coverage analysis, test automation, and other QA techniques are well handled by other resources.

Debugging usually means figuring out why a design doesn’t work as planned. Troubleshooting usually means figuring out what’s broken in a particular copy of a product when the product’s design is known to be good—there’s a deleted file, a broken wire, or a bad part.

You need a working knowledge of what the system is supposed to do, how it’s designed, and, in some cases, why it was designed that way. If you don’t understand some part of the system, that always seems to be where the problem is. (This is not just Murphy’s Law; if you don’t understand it when you design it, you’re more likely to mess up.)

The essence of “Understand the System” is, “Read the manual.” Contrary to my dad’s comment, read it first—before all else fails. When you buy something, the manual tells you what you’re supposed to do to it, and how it’s supposed to act as a result. You need to read this from cover to cover and understand it in order to get the results you want. Sometimes you’ll find that it can’t do what you want—you bought the wrong thing.

A caution here: Don’t necessarily trust this information. Manuals (and engineers with Beemers in their eyes) can be wrong, and many a difficult bug arises from this. But you still need to know what they thought they built, even if you have to take that information with a bag of salt.

There’s a side benefit to understanding your own systems, too. When you do find the bugs, you’ll need to fix them without breaking anything else. Understanding what the system is supposed to do is the first step toward not breaking it.

the function that you assume you understand is the one that bites you. The parts of the schematic that you ignore are where the noise is coming from. That little line on the data sheet that specifies an obscure timing parameter can be the one that matters.

Reference designs and sample programs tell you one way to use a product, and sometimes this is all the documentation you get. Be careful with such designs, however; they are often created by people who know their product but don’t follow good design practices, or don’t design for real-world applications. (Lack of error recovery is the most popular shortcut.) Don’t just lift the design; you’ll find the bugs in it later if you don’t find them at first.

When there are parts of the system that are “black boxes,” meaning that you don’t know what’s inside them, knowing how they’re supposed to interact with other parts allows you to at least locate the problem as being inside the box or outside the box.

You also have to know the limitations of your tools. Stepping through source code shows logic errors but not timing or multithread problems; profiling tools can expose timing problems but not logic flaws.

Now, if Charlie were working at my company, and you asked him, “What do you do when you find a failure?” he would answer, “Try to make it fail again.” (Charlie is well trained.) There are three reasons for trying to make it fail:

  1. So you can look at it.
  2. So you can focus on the cause. Knowing under exactly what conditions it will fail helps you focus on probable causes.
  3. So you can tell if you’ve fixed it. Once you think you’ve fixed the problem, having a surefire way to make it fail gives you a surefire test of whether you fixed it.

If without the fix it fails 100 percent of the time when you do X, and with the fix it fails zero times when you do X, you know you’ve really fixed the bug. (This is not silly. Many times an engineer will change the software to fix a bug, then test the new software under different conditions from those that exposed the bug. It would have worked even if he had typed limericks into the code, but he goes home happy. And weeks later, in testing or, worse, at the customer site, it fails again. More on this later, too.)

You have enough bugs already; don’t try to create new ones. Use instrumentation to look at what’s going wrong (see Rule 3: Quit Thinking and Look), but don’t change the mechanism; that’s what’s causing the failure.

How many engineers does it take to fix a lightbulb? A: None; they all say, “I can’t reproduce it—the lightbulb in my office works fine.”

Automation can make an intermittent problem happen much more quickly, as in the TV game story. Amplification can make a subtle problem much more obvious, as in the leaky window example, where I could locate the leak better with a hose than with the occasional rainstorm. Both of these techniques help stimulate the failure, without simulating the mechanism that’s failing. Make your changes at a high enough level that they don’t affect how the system fails, just how often.

What can you do to control these other conditions? First of all, figure out what they are. In software, look for uninitialized data (tsk, tsk!), random data input, timing variations, multithread synchronization, and outside devices (like the phone network or the six thousand kids clicking on your Web site). In hardware, look for noise, vibration, temperature, timing, and parts variations (type or vendor). In my all-wheel-drive example, the problem would have seemed intermittent if I hadn’t noticed the temperature and the speed.

Once you have an idea of what conditions might be affecting the system, you simply have to try a lot of variations. Initialize those arrays and put a known pattern into the inputs of your erratic software. Try to control the timing and then vary it to see if you can get the system to fail at a particular setting. Shake, heat, chill, inject noise into, and tweak the clock speed and the power supply voltage of that unreliable circuit board until you see some change in the frequency of failure.

Sometimes you’ll find that controlling a condition makes the problem go away. You’ve discovered something—what condition, when random, is causing the failure. If this happens, of course, you want to try every possible value of that condition until you hit the one that causes the system to fail. Try every possible input data pattern if a random one fails occasionally and a controlled one doesn’t.

One caution: Watch out that the amplified condition isn’t just causing a new error. If a board has a temperature-sensitive error and you decide to vibrate it until all the chips come loose, you’ll get more errors, but they won’t have anything to do with the original problem.

You have to be able to look at the failure. If it doesn’t happen every time, you have to look at it each time it fails, while ignoring the many times it doesn’t fail. The key is to capture information on every run so you can look at it after you know that it’s failed. Do this by having the system output as much information as possible while it’s running and recording this information in a “debug log” file.

By looking at captured information, you can easily compare a bad run to a good one (see Rule 5: Change One Thing at a Time). If you capture the right information, you will be able to see some difference between a good case and a failure case. Note carefully the things that happen only in the failure cases. This is what you look at when you actually start to debug.

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” —SHERLOCK HOLMES, A SCANDAL IN BOHEMIA

We lost several months chasing the wrong thing because we guessed at the failure instead of looking at it.

Actually seeing the low-level failure is crucial. If you guess at how something is failing, you often fix something that isn’t the bug. Not only does the fix not work, but it takes time and money and may even break something else. Don’t do it.

“Quit thinking and look.” I make this statement to engineers more often than any other piece of debugging advice. We sometimes tease engineers who come up with an idea that seems pretty good on the surface, but on further examination isn’t very good at all, by saying, “Well, he’s a thinker.” All engineers are thinkers. Engineers like to think.

So why do we imagine we can find the problem by thinking about it? Because we’re engineers, and thinking is easier than looking.

What we see when we note the bug is the result of the failure: I turned on the switch and the light didn’t come on. But what was the actual failure? Was it that the electricity couldn’t get through the broken switch, or that it couldn’t get through the broken bulb filament? (Or did I flip the wrong switch?) You have to look closely to see the failure in enough detail to debug it.

Many problems are easily misinterpreted if you can’t see all the way to what’s actually happening. You end up fixing something that you’ve guessed is the problem, but in fact it was something completely different that failed.

How deep should you go before you stop looking and start thinking again? The simple answer is, “Keep looking until the failure you can see has a limited number of possible causes to examine.”

Experience helps here, as does understanding your system. As you make and chase bad guesses, you’ll get a feel for how deep you have to see in a given case. You’ll know when the failure you see implicates a small enough piece of the design. And you’ll understand that the measure of a good debugger is not how soon you come up with a guess or how good your guesses are, but how few bad guesses you actually act on.

Seeing the failure in low-level detail has another advantage in dealing with intermittent bugs, which we’ve discussed before and will discuss again: Once you have this view of the failure, when you think you’ve fixed the bug, it’s easy to prove that you did fix the bug. You don’t have to rely on statistics; you can see that the error doesn’t happen anymore. When our senior engineer fixed the noise problem on our slave microprocessor, he could see that the glitch in the write pulse was gone.

In the world of electronic hardware, this means test points, test points, and more test points. Add a test connector to allow easy access to buses and important signals. These days, with programmable gate arrays and application-specific integrated circuits, the problem is often buried inside a chunk of logic that you can’t get into with external instruments, so the more signals you can bring out of the chip, the better off you’ll be.

As mentioned in “Make It Fail,” even minor changes can affect the system enough to hide the bug completely. Instrumentation is one of those changes, so after you’ve added instrumentation to a failing system, make it fail again to prove that Heisenberg isn’t biting you.

“Quit Thinking and Look” doesn’t mean that you don’t ever make any guesses about what might be wrong. Guessing is a pretty good thing, especially if you understand the system. Your guesses may even be pretty close, but you should guess only to focus the search. You still have to confirm that your guess is correct by seeing the failure before you go about trying to fix the failure.

So don’t trust your guesses too much; often they’re way off and will lead you down the wrong path. If it turns out that careful instrumentation doesn’t confirm a particular guess, then it’s time to back up and guess again. (Or reconsult your Ouija board, or throw another dart at your “bug cause” dartboard—whatever your methodology may be. I recommend the use of Rule 4: “Divide and Conquer.

An exception: One reason for guessing in a particular way is that some problems are more likely than others or easier to fix than others, so you check those out first. In fact, when you make a particular guess because that problem is both very likely and easy to fix, that’s the one time you should try a fix without actually seeing the details of the failure.

You can think up thousands of possible reasons for a failure. You can see only the actual cause. See the failure. The senior engineer saw the real failure and was able to find the cause. The junior guys thought they knew what the failure was and fixed something that wasn’t broken. See the details. Don’t stop when you hear the pump. Go down to the basement and find out which pump. Build instrumentation in. Use source code debuggers, debug logs, status messages, flashing lights, and rotten egg odors. Add instrumentation on. Use analyzers, scopes, meters, metal detectors, electrocardiography machines, and soap bubbles. Don’t be afraid to dive in. So it’s production software. It’s broken, and you’ll have to open it up to fix it. Watch out for Heisenberg. Don’t let your instruments overwhelm your system. Guess only to focus the search. Go ahead and guess that the memory timing is bad, but look at it before you build a timing fixer.

After he did this with all the pins, he plugged in the lines and proved that the terminals worked. Then he unplugged everything, reassembled the box, plugged everything back in, and reconfirmed that the terminals worked. This was to accommodate Goldberg’s Corollary to Murphy’s Law, which states that reassembling any more than is absolutely necessary before testing makes it probable that you have not fixed the problem and will have to disassemble everything again, with a probability that increases in proportion to the amount of reassembly effort involved.

But the rule that our technician demonstrated particularly well with this debugging session was “Divide and Conquer.” He narrowed the search by repeatedly splitting up the search space into a good half and a bad half, then looking further into the bad half for the problem.

Narrow the search. Home in on the problem. Find the range of the target. The common technique used in any efficient target search is called successive approximation—you want to find something within a range of possibilities, so you start at one end of the range, then go halfway to the other end and see if you’re past it or not. If you’re past it, you go to one-fourth and try again. If you’re not past it, you go to three-fourths and try again. Each try, you figure out which direction the target is from where you are and move half the distance of the previous move toward the target. After a small number of attempts, you home right in on it.

When you inject known input patterns, of course, you should be careful that you don’t change the bug by setting up new conditions. If the bug is pattern dependent, putting in an artificial pattern may hide the problem. “Make It Fail” before proceeding.

So when you do figure out one of several simultaneous problems, fix it right away, before you look for the others.

I’ve often heard someone say, “Well, that’s broken, but it couldn’t possibly affect the problem we’re trying to find.” Guess what—it can, and it often does. If you fix something that you know is wrong, you get a clean look at the other issues. Our hotel technician was able to see the direction of the bad wiring on the really slow terminal only after he fixed the high resistances in both directions in the breakout box.

A corollary to the previous rule is that certain kinds of bugs are likely to cause other bugs, so you should look for and fix them first. In hardware, noisy signals cause all kinds of hard-to-find, intermittent problems. Glitches and ringing on clocks, noise on analog signals, jittery timing, and bad voltage levels need to be taken care of before you look at other problems; the other problems are often very unpredictable and go away when you fix the noise. In software, bad multithread synchronization, accidentally reentrant routines, and uninitialized variables inject that extra shot of randomness that can make your job hell.

It’s also easy to become a perfectionist and start “fixing” every instance of bad design practice you find, in the interest of general quality. You can eliminate the GOTOs in your predecessor’s code simply because you consider them nasty, but if they aren’t actually causing problems, you’re usually better off leaving them alone.

When his change didn’t fix the problem, he should have backed it out immediately.

Change one thing at a time. You’ve heard of the shotgun approach? Forget it. Get yourself a good rifle. You’ll be a lot better at fixing bugs.

If you’re working on a mortgage calculation program that seems to be messing up on occasional loans, pin down the loan amount and the term, and vary the interest rate to see that the program does the right thing. If that works, pin down the term and interest rate, and vary the loan amount. If that works, vary just the term. You’ll either find the problem in your calculation or discover something really surprising like a math error in your Pentium processor. (“But that can’t happen!”) This kind of surprise happens with the more complex bugs; that’s what makes them complex. Isolating and controlling variables is kind of like putting known data into the system: It helps you see the surprises.

Sometimes, changing the test sequence or some operating parameter makes a problem occur more regularly; this helps you see the failure and may be a great clue to what’s going on. But you should still change only one thing at a time so that you can tell exactly which parameter had the effect. And if a change doesn’t seem to have an effect, back it out right away!

What you’re looking for is never the same as last time. And it takes a fair amount of knowledge and intelligence to sift through the irrelevant differences, differences that were caused by timing or other factors. This knowledge is beyond what a beginner has, and this intelligence is beyond what software can do. (A.I. waved good-bye, remember?) The most that software can do is help you to format and filter logs so that when you apply your superior human brain (you do have a superior one, don’t you?) to analyzing the logs, the differences (and possibly the cause of the differences) will jump right out at that brain.

Sometimes the difference between a working system and a broken one is that a design change was made. When this happens, a good system starts failing. It’s very helpful to figure out which version first caused the system to fail, even if this involves going back and testing successively older versions until the failure goes away.

The point of this story is that sometimes it’s the most insignificant-seeming thing that’s actually the key to making a bug happen. What seems insignificant to the person doing the testing (the plaid shirt) may be important to the person trying to fix the problem. And what seems obvious to the tester (the chip had to be restarted) may be completely missed by the fixer. So you have to take note of everything—on the off chance that it might be important and nonobvious.

Keep an audit trail. As you investigate a problem, write down what you did, what order you did it in, and what happened as a result. Do this every time. It’s just like instrumenting the software or the hardware—you’re instrumenting the test sequence. You have to be able to see what each step was, and what the result was, to determine which step to focus on during debugging.

Unfortunately, while the value of an audit trail is often accepted, the level of detail required is not, so a lot of important information gets left out. What kind of system was running? What was the sequence of events leading up to the failure? And sometimes even, what was the actual failure? (Duh!) Sometimes the report just says, “It’s broken.” It doesn’t say that the graphics were completely garbled or that all the red areas came up green or that the third number was wrong. It just says it failed.

Tool control is critical to the accurate re-creation of a version, and you should make sure you have it. As discussed later, unrecognized tool variations can cause some very strange effects.

Never trust your memory with a detail—write it down. If you trust your memory, a number of things will happen. You’ll forget the details that you didn’t think were important at the time, and those, of course, will prove to be the critical ones. You’ll forget the details that actually weren’t important to you, but might be important to someone else working on a different problem later.

You won’t be able to transmit information to anyone else except verbally, which wastes everybody’s time, assuming you’re still around to talk about it. And you won’t be able to remember exactly how things happened and in what order and how events related to one another, all of which is crucial information.

Write it down. It’s better to write it electronically so you can make backup copies, attach it to bug reports, distribute it to others easily, and maybe even filter it with automated analysis tools later. Write down what you did and what happened as a result. Save your debug logs and traces, and annotate them with related events and effects that they don’t inherently record themselves. Write down your theories and your fixes. Write it all down.

This is more likely to happen with what I call “overhead” or “foundation” factors. Because they’re general requirements (electricity, heat, clock), they get overlooked when you debug the details.

It’s classic to say, “Hmm, this new code works just like the old code” and then find out that, in fact, you didn’t actually load the new code. You loaded the old code, or you loaded the new code but it’s still executing the old code because you didn’t reboot your computer or you left an easier-to-find copy of the old code on your system.

Speaking of starting your car, another aspect to consider is whether the start-up conditions are correct. You may have the power plugged in, but did you hit the start button? Has the graphics driver been initialized? Has the chip been reset? Have the registers been programmed correctly? Did you push the primer button on your weed whacker three times? Did you set the choke? Did you set the on/off switch to on? (I usually notice this one after six or seven fruitless pulls.)

If you depend on memory being initialized before your program runs, but you don’t do it explicitly, it’s even worse—sometimes startup conditions will be correct. But not when you demo the program to the investors.

Your bad assumptions may not be about the product you’re building, but rather about the tools you’re using to build it, as in the consultant story.

Default settings are a common problem. Building for the wrong environment is another—if you use a Macintosh compiler, you obviously can’t run the program on an Intel PC, but what about your libraries and other common code resources?

It may not be just your assumptions about the tools that are bad—the tools may have bugs, too. (Actually, even this is just your bad assumption that the tool is bug-free. It was built by engineers; why would it be any more trustworthy than what you’re building?)

“Nothing clears up a case so much as stating it to another person.” —SHERLOCK HOLMES, SILVER BLAZE

There are at least three reasons to ask for help, not counting the desire to dump the whole problem into someone else’s lap: a fresh view, expertise, and experience. And people are usually willing to help because it gives them a chance to demonstrate how clever they are.

It’s hard to see the big picture from the bottom of a rut. We’re all human. We all have our biases about everything, including where a bug is hiding. Those biases can keep us from seeing what’s really going on. Someone who comes at the problem from an unbiased (actually, differently biased) viewpoint can give us great insights and trigger new approaches. If nothing else, that person can at least tell you it looks like you’ve got a nasty problem there and offer you a shoulder to cry on.

In fact, sometimes explaining the problem to someone else gives you a fresh view, and you solve the problem yourself. Just organizing the facts forces you out of the rut you were in. I’ve even heard of a company that has a room with a mannequin in it—you go explain your problems to the mannequin first. I imagine the mannequin is quite useful and contributes to the quick solution of a number of problems. (It’s probably more interactive than some people you work with. I bet it’s also a very forgiving listener, and none of your regrettable misconceptions will show up in next year’s salary review.)

There are occasions where a part of the system is a mystery to us; rather than go to school for a year to learn about it, we can ask an expert and learn what we need to know quickly. But be sure that your expert really is an expert on the subject—if he gives you vague, buzzword-laden theories, he’s a technical charlatan and won’t be helpful. If he tells you it’ll take thirty hours to research and prepare a report, he’s a consultant and may be helpful, but at a price.

In any case, experts “Understand the System” better than we do, so they know the road map and can give us great search hints. And when we’ve found the bug, they can help us design a proper fix that won’t mess up the rest of the system.

You may not have a whole lot of experience, but there may be people around you who have seen this situation before and, given a quick description of what’s going on, can tell you exactly what’s wrong, like the dome light short in the story. Like experts, people with experience in a specific area can be hard to find, and thus expensive. It may be worth the money:

If you’re dealing with equipment or software from a third-party vendor, e-mail or call that vendor. (If nothing else, the vendor will appreciate the bug report.) Usually, you’ll be told about some common misunderstanding you’ve got; remember, the vendor has both expertise and experience with its product.

This is a good time to reiterate “Read the Manual.” In fact, this is where the advice becomes “When All Else Fails, Read the Manual Again.” Yes, you already read it because you followed the first rule faithfully. Look at it again, with your newfound focus on your particular problem—you may see or understand something that you didn’t before.

You may be afraid to ask for help; you may think it’s a sign of incompetence. On the contrary, it’s a sign of true eagerness to get the bug fixed. If you bring in the right insight, expertise, and/or experience, you get the bug fixed faster. That doesn’t reflect poorly on you; if anything, it means you chose your help wisely.

The opposite is also true: Don’t assume that you’re an idiot and the expert is a god. Sometimes experts screw up, and you can go crazy if you assume it’s your mistake.

No matter what kind of help you bring in, when you describe the problem, keep one thing in mind: Report symptoms, not theories. The reason you went to someone else for fresh insight is that your theories aren’t getting you anywhere. If you go to somebody fresh and lay a theory on her, you drag her right down into the same rut you’re in.

I’ve seen this mistake a lot; the help is brought in and immediately poisoned with the old, nonworking theories. (If the theories were any good, there’d be no need to bring in help.) In some cases, a good helper can plow through all that garbage and still get to the facts, but more often you just end up with a big crowd down in that rut.

This rule works both ways. If you’re the helper, cut off the helpee who starts to tell you theories. Cover your ears and chant “La-La-LaLa-La.” Run away. Don’t be poisoned.

If you follow the “Make It Fail” rule, you’ll know how to prove that you’ve fixed the problem. Do it! Don’t assume that the fix works; test it. No matter how obvious the problem and the fix seem to be, you can’t be sure until you test it. You might be able to charge some young sucker seventy-five bucks, but he won’t be happy about it when it still fails.

When you think you’ve fixed an engineering design, take the fix out. Make sure it’s broken again. Put the fix back in. Make sure it’s fixed again. Until you’ve cycled from fixed to broken and back to fixed again, changing only the intended fix, you haven’t proved that you fixed it.

If you didn’t fix it, it ain’t fixed. Everyone wants to believe that the bug just went away. “We can’t seem to make it fail anymore.” “It happened that couple of times, but gee, then something happened and it stopped failing.” And of course the logical conclusion is, “Maybe it won’t happen again.” Guess what? It will.

When they report the problem, it’s better to say, “Thanks! We’ve been trying for months to capture that incredibly rare occurrence; please e-mail us the log file,” than to say, “Wow, incredible. That’s never happened here.”

While we’re on the subject of design quality, I’ve often thought of ISO-9000 as a method for keeping an audit trail of the design process. In this case, the bugs you’re trying to find are in the process (neglecting vibration) rather than in the product (leaky fitting), but the audit trail works the same way. As advertised in the introduction, the methods presented in this book are truly general purpose.

“That can’t happen” is a statement made by someone who has only thought about why something can’t happen and hasn’t looked at it actually happening.

New hardware designs with intermittent errors are often the victims of noise (stray voltages on a wire that turn a 1 into a 0 or vice versa).

By introducing additional noise into the system, I was able to make it fail more often. This was a bit lucky, though. For some circuits, touching a finger is like testing that leaky window with a fire hose—the noise will cause a perfectly good circuit to fail. Sometimes the opposite will occur; the capacitance of the finger will eliminate the noise and quiet the system down.

I’ve established a Web site at http://www.debuggingrules.com that’s dedicated to the advancement of debugging skills everywhere. You should visit, if for no other reason than to download the fancy Debugging Rules poster—you can print it yourself and adorn your office wall as recommended in the book. There are also links to various other resources that you may find useful in your debugging education. And I’ll always be interested in hearing your interesting, humorous, or instructive (preferably all three) war stories; the Web site will tell you how to send them. Check it out.

You can appeal to their sense of team communication. Several of the people who reviewed drafts of the book were team leaders who consider themselves excellent debuggers, but they found that the rules crystallized what they do in terms that they could more easily communicate to their teams. They found it easy to guide their engineers by saying, “Quit Thinking and Look,” the way I’ve been doing for twenty years.

Related Posts