Book Review

Timeless Laws of Software Development

I am always seeking the wisdom and insights of those who have spent decades working in software development. The experiences of those who came before us is a rich source of wisdom, information, and techniques.

Only a few problems in our field are truly new. Most of the solutions we seek have been written about time-and-time-again over the past 50 years. Rather than continually seeking new technology as the panacea to our problems, we should focus ourselves on applying the tried and tested basic principles of our field.

Given my point of view, it's no surprise that I was immediately drawn to a book titled Timeless Laws of Software Development.

The author, Jerry Fitzpatrick, is a software instructor and consultant who has worked in a variety of industries: biomedical, fitness, oil and gas, telecommunications, and manufacturing. Even more impressive for someone writing about the Timeless Laws of Software Development, Jerry was originally an electrical engineer. He worked with Bob Martin and James Grenning at Teradyne, where he developed the hardware for Teradyne's early voice response system.

Jerry has spent his career dealing with the same problems we are currently dealing with. It would be criminal not to steal and apply his hard-earned knowledge.

I recommend this invaluable book equally to developers, team leads, architects, and project managers.

Table of Contents:

  1. Structure of the Book
  2. The Timeless Laws
  3. What I Learned
  4. Selected Quotes
  5. Buy the Book

Structure of the Book

The book is short, weighing in at a total of 180 pages, including the appendices, glossary, and index. Do not be fooled by its small stature, for there is much wisdom packed into these pages.

Jerry opens with an introductory chapter and dedicates an entire chapter to each of his six Timeless Laws (discussed below). Each law is broken down into sub-axioms, paired with examples, and annotated with quotes and primary sources.

Aside from the always-useful glossary and index, Jerry ends the book with three appendices, each valuable in its own right:

  • "About Software Metrics", which covers metrics including lines of code, cyclomatic complexity, software size, and Jerry's own "ABC" metric
  • "Exploring Old Problems", which covers symptoms of the software crisis, the cost to develop software, project factors and struggles, software maintenance costs, superhuman developers, and software renovation.
  • "Redesigning a Procedure", where Jerry walks readers through a real-life refactoring exercise

"Exploring Old Problems" was an exemplary chapter. I highly recommended it to project managers and team leads.

My only real critique of the book is that the information is not partitioned in a way that makes it easily accessible to different roles - project managers may miss valuable lessons while glossing over programming details. Don't give in to the temptation to skip: each chapter has valuable advice no matter your role.

The Timeless Laws

Jerry proposes six Timeless Laws of software development:

  1. Plan before implementing
  2. Keep the program small
  3. Write clearly
  4. Prevent bugs
  5. Make the program robust
  6. Prevent excess coupling

At first glance, these six laws are so broadly stated that the natural reaction is, "Duh". Where the book shines is in the breakdown of these laws into sub-axioms and methods for achieving the intent of the law.

Breakdown of the Timeless Laws

  1. Plan before implementing
    1. Understand the requirements
    2. Reconcile conflicting requirements
    3. Check the feasibility of key requirements
    4. Convert assumptions to requirements
    5. Create a development plan
  2. Keep the program small
    1. Limit project features
    2. Avoid complicated designs
    3. Avoid needless concurrency
    4. Avoid repetition
    5. Avoid unnecessary code
    6. Minimize error logging
    7. Buy, don't build
    8. Strive for Reuse
  3. Write clearly
    1. Use names that denote purpose
    2. Use clear expressions
    3. Improve readability using whitespace
    4. Use suitable comments
    5. Use symmetry
    6. Postpone optimization
    7. Improve what you have written
  4. Prevent bugs
    1. Pace yourself
    2. Don't tolerate build warnings
    3. Manage Program Inputs
    4. Avoid using primitive types for physical quantities
    5. Reduce conditional logic
    6. Validity checks
    7. Context and polymorphism
    8. Compare floating point values correctly
  5. Make the program robust
    1. Don't let bugs accumulate
    2. Use assertions to expose bugs
    3. Design by contract
    4. Simplify exception handling
    5. Use automated testing
    6. Invite improvements
  6. Prevent excess coupling
    1. Discussion of coupling
    2. Flexibility
    3. Decoupling
    4. Abstractions (functional, data, OO)
    5. Use black boxes
    6. Prefer cohesive abstractions
    7. Minimize scope
    8. Create barriers to coupling
    9. Use atomic initialization
    10. Prefer immutable instances

What I Learned

I've regularly referred to this book over the past year. My hard-copy is dog-eared and many pages are covered in notes, circles, and arrows.

I've incorporated many aspects of the book into my development process. I've created checklists that I use for design reviews and code reviews, helping to ensure that I catch problems as early as possible. I've created additional documentation for my projects, as well as templates to facilitate ease of reuse.

Even experienced developers and teams can benefit from a review of this book. Some of the concepts may be familiar to you, but we all benefit from a refresher. There is also the chance that you will find one valuable gem to improve your practice, and isn't that worth the small price of a book?

The odds are high that you'll find more than one knowledge gem while reading Timeless Laws.

Here are some of the lessons I took away from the book:

  1. Create a development plan
  2. Avoid the "what if" game
  3. Logging is harmful
  4. Defensive programming is harmful
  5. Utilize symmetry in interface design

Create a Development Plan

We are all familiar with the lack of documentation for software projects. I'm repeatedly stunned by teams which provide no written guidance or setup instructions for new members. Jerry points out the importance of maintaining documentation:

Documentation is the only way to transfer knowledge without describing things in person.

One such method that I pulled from the book is the idea of the "Development Plan". The plan serves as a guide for developers working on the project. The plan describes the development tools, project, goals, and priorities.

As with all documentation, start simple and grow the development plan as new information becomes available or required. Rather than having a large document, it's easy to break the it up into smaller, standalone files. Having separate documents will help developers easily find the information they need. The development plan should be kept within the repository so developers can easily find and update it.

Topics to cover in your development plan include:

  • List of development priorities
  • Code organization
  • How to set up the development environment
  • Minimum requirements for hardware, OS, compute power, etc.
  • Glossary of project terms
  • Uniform strategy for bug prevention, detection, and repair
  • Uniform strategy for program robustness
  • Coding style guidelines (if applicable)
  • Programming languages to be used, and where they are used
  • Tools to be used for source control, builds, integration, testing, and deployment
  • High-level organization: projects, components, file locations, and naming conventions
  • High-level logical architecture: major sub-systems and frameworks

Development plans are most useful for new team members, since they can refer to the document and become productive without taking much time from other developers. However, your entire team will benefit from having a uniform set of guidelines that can be easily located and referenced.

Avoid the "What If" Game

Many of us, myself included, are guilty of participating in the "what if" game. The "what if" game is prevalent among developers, especially when new ideas are proposed. The easiest way to shoot a hole in a new idea is to ask a "what if" question: "This architecture looks ok, but what if we need to support 100,000,000 connections at once?"

The "what if" game is adversarial and can occur because:

  • Humans have a natural resistance to change
  • Some people enjoy showing off their knowledge
  • Some people enjoy being adversarial
  • The dissenter dislikes the person who proposed the idea
  • The dissenter does not want to take on additional work

"What if" questions are difficult to refute, as they are often irrational. We should always account for realistic possibilities, but objections should be considered only if the person can explain why the proposal is disruptive now or is going to be disruptive in the future.

Aside from keeping conversations focused on realistic possibilities, we can mitigate the ability to ask "what if" with clear and well-defined requirements.

Logging is Harmful

I have been a long-time proponent of error logging, and I’ve written many embedded logging libraries over the past decade.

While I initially was skeptical of Fitzpatrick’s attitude toward error logging, I started paying closer attention to the log files I was working with as well as the use of logging in my own code. I noticed the points that Jerry highlighted: my code was cluttered, logs were increasingly useless, and it was always a struggle to remove outdated logging statements.

You can read more about my thoughts on error logging in my article: The Dark Side of Error Logging.

Defensive Programming is Harmful

Somewhere along the way in my career, the idea of defensive programming was drilled into me. Many of my old libraries and programs are layered with unnecessary conditional statements and error-code returns. These checks contribute to code bloat, since they are often repeated at multiple levels in the stack.

Jerry points out that in conventional product design, designs are based on working parts, not defective ones. As such, designing our software systems based on the assumption that all modules are potentially defective leads us down the path of over-engineering.

Trust lies at the heart of defensive programming. If no module can be trusted, then defensive programming is imperative. If all modules can be trusted, then defensive programming is irrelevant.

Like conventional products, software should be based on working parts, not defective ones. Modules should be presumed to work until proven otherwise. This is not to say that we don't do any form of checking: inputs from outside of the program need to be validated.

Assertions and contracts should be used to enforce preconditions and postconditions. Creating hard failure points helps us to catch bugs as quickly as possible. Modules inside of the system should be trusted to do their job and to enforce their own requirements.

Since I've transitioned toward the design-by-contract style, my code is much smaller and easier to read.

Utilize Symmetry in Interface Design

Using symmetry in interface design is one of those points that seemed obvious on the surface. Upon further inspection, I found I regularly violated symmetry rules in my interfaces.

Symmetry helps us to manage the complexity of our programs and reduce the amount of knowledge we need to keep in mind at once. Since we have existing associations with naming pairs, we can easily predict function names without needing to look them up.

Universal naming pairs should be used in public interfaces whenever possible:

  • on/off
  • start/stop
  • enable/disable
  • up/down
  • left/right
  • get/set
  • empty/full
  • push/pop
  • create/destroy

Our APIs should also be written in a consistent manner:

  • Motor::Start() / Motor::Stop()
  • motor_start() / motor_stop()
  • StartMotor() / StopMotor()

Avoid creating (and fix!) inconsistent APIs:

  • Motor::Start() / Motor::disable()
  • startMotor / stop_motor
  • start_motor / Stop_motor

Naming symmetry may be obvious, but where I am most guilty is in parameter order symmetry. Our procedures should utilize the same parameter ordering rules whenever possible.

For example, consider the C standard library functions defined in string.h. In all but one procedure (strlen), the first parameter is the destination string, and the second parameter is the source string. The parameter order also matches the normal assignment order semantics (dest = src).

The standard library isn't the holy grail of symmetry, however. The stdio.h header showcases some bad symmetry by changing the location of the FILE pointer:

int fprintf ( FILE * stream, const char * format, ... );
int fscanf ( FILE * stream, const char * format, ... );

// Better design: FILE is first!
int fputs ( const char * str, FILE * stream );
char * fgets ( char * str, int num, FILE * stream );

Keeping symmetry in mind will improve the interfaces we create.

Selected Quotes

I pulled hundreds of quotes from this book, and you will be seeing many of them pop up on our Twitter Feed over the next year. A small selection of my highlights are included below.

Any quotes without attribution come directly from Jerry.

Intentionally hiding a bug is the greatest sin a developer can commit.

Failure is de rigueur in our industry. Odds are, you're working on a project that will fail right now.
-- Jeff Atwood, How to Stop Sucking and Be Awesome

Writing specs is like flossing: everybody agrees that it's a good thing, but nobody does.
-- Joel Spolsky

Documentation is the only way to transfer knowledge without describing things in person.

Robustness must be a goal and up front priority.

Disorder is the natural state of all things. Software tends to get larger and more complicated unless the developers push back and make it smaller and simpler. If the developers don't push back, the battle against growth is lost by default.

YAGNI (You ain't gonna need it):
Always implement things when you actually need them, never when you just foresee that you need them. The best way to implement code quickly is to implement less of it. The best way to have fewer bugs is to implement less code.

-- Ron Jeffries

Most developers write code that reflects their immediate thoughts, but never return to make it smaller or clearer.

The answer is to clear our heads of clutter. Clear thinking becomes clear writing; one can't exist without the other.
-- William Zinsser

Plan for tomorrow but implement only for today.

Code that expresses its purpose clearly - without surprises - is easier to understand and less likely to contain bugs.

Most developers realize that excess coupling is harmful but they don't resist it aggressively enough. Believe me: if you don't manage coupling, coupling will manage you.

Few people realize how badly they write.
-- William Zinsser

To help prevent bugs, concurrency should only be used when needed. When it is needed, the design and implementation should be handled carefully.

Sometimes problems are poorly understood until a solution is implemented and found lacking. For this reason, it's often best to implement a basic solution before attempting a more complete and complicated one. Adequate solution are usually less costly than optimal ones.

I've worked with many developers who didn't seem to grasp the incredible speed at which program instructions execute. They worried about things that would have a tiny effect on performance or efficiency. They should have been worried about bug prevention and better-written code.

Most sponsors would rather have a stable program delivered on-time than a slightly faster and more efficient program delivered late.

It's better to implement features directly and clearly, then optimize any that affect users negatively.

Efficiency and performance are only problems if the requirements haven't been met. Optimization usually reduces source code clarity, so it isn't justified for small gains in efficiency or performance. Our first priorities should be correctness, clarity, and modest flexibility.

Implementation is necessarily incremental, but a good architecture is usually holistic. It requires a thorough understanding of all requirements.

Buy the Book

If you are interested in purchasing Timeless Laws of Software Development, you can support Embedded Artistry by using our Amazon affiliate link:

Related Posts

Debugging: 9 Indispensable Rules

The official title of this book is quite a mouthful: Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems, by David J. Agans.

Agans distills debugging techniques down to nine essential rules and includes engineering war stories that demonstrate these principles. This book helped me crystallize my debugging strategy and provided the language I now use to describe my techniques. This book also reinforced the importance of techniques that I have only loosly employed. I'm certain many of you are similarly guilty of violating the principles of "Change One Thing at a Time" and "Keep an Audit Trail"!

Here's the full list of rules:

  • Understand the System
  • Make it Fail
  • Quit Thinking and Look
  • Divide and Conquer
  • Change One Thing at a Time
  • Keep an Audit Trail
  • Check the Plug
  • Get a Fresh View
  • If You Didn't Fix it, It Ain't Fixed

I highly recommend reading this book if you want to improve your debugging chops and engineering skills. Even veteran debuggers will likely learn a thing or two.

Debuggers who naturally use these rules are hard to find. I like to ask job applicants, “What rules of thumb do you use when debugging?” It’s amazing how many say, “It’s an art.” Great — we’re going to have Picasso debugging our image-processing algorithm. The easy way and the artistic way do not find problems quickly.

You can find out more at Agans's website Debugging Rules!

Buy the Book

If you are interested in purchasing this book, you can support Embedded Artistry by using our Amazon affiliate link:

My Highlights

When it took us a long time to find a bug, it was because we had neglected some essential, fundamental rule; once we applied the rule, we quickly found the problem.

People who excelled at quick debugging inherently understood and applied these rules. Those who struggled to understand or use these rules struggled to find bugs.

Debuggers who naturally use these rules are hard to find. I like to ask job applicants, “What rules of thumb do you use when debugging?” It’s amazing how many say, “It’s an art.” Great—we’re going to have Picasso debugging our image-processing algorithm. The easy way and the artistic way do not find problems quickly.

Quality process techniques are valuable, but they’re often not implemented; even when they are, they leave some bugs in the system.

Once you have bugs, you have to detect them; this takes place in your quality assurance (QA) department or, if you don’t have one of those, at your customer site. This book doesn’t deal with this stage either—test coverage analysis, test automation, and other QA techniques are well handled by other resources.

Debugging usually means figuring out why a design doesn’t work as planned. Troubleshooting usually means figuring out what’s broken in a particular copy of a product when the product’s design is known to be good—there’s a deleted file, a broken wire, or a bad part.

You need a working knowledge of what the system is supposed to do, how it’s designed, and, in some cases, why it was designed that way. If you don’t understand some part of the system, that always seems to be where the problem is. (This is not just Murphy’s Law; if you don’t understand it when you design it, you’re more likely to mess up.)

The essence of “Understand the System” is, “Read the manual.” Contrary to my dad’s comment, read it first—before all else fails. When you buy something, the manual tells you what you’re supposed to do to it, and how it’s supposed to act as a result. You need to read this from cover to cover and understand it in order to get the results you want. Sometimes you’ll find that it can’t do what you want—you bought the wrong thing.

A caution here: Don’t necessarily trust this information. Manuals (and engineers with Beemers in their eyes) can be wrong, and many a difficult bug arises from this. But you still need to know what they thought they built, even if you have to take that information with a bag of salt.

There’s a side benefit to understanding your own systems, too. When you do find the bugs, you’ll need to fix them without breaking anything else. Understanding what the system is supposed to do is the first step toward not breaking it.

the function that you assume you understand is the one that bites you. The parts of the schematic that you ignore are where the noise is coming from. That little line on the data sheet that specifies an obscure timing parameter can be the one that matters.

Reference designs and sample programs tell you one way to use a product, and sometimes this is all the documentation you get. Be careful with such designs, however; they are often created by people who know their product but don’t follow good design practices, or don’t design for real-world applications. (Lack of error recovery is the most popular shortcut.) Don’t just lift the design; you’ll find the bugs in it later if you don’t find them at first.

When there are parts of the system that are “black boxes,” meaning that you don’t know what’s inside them, knowing how they’re supposed to interact with other parts allows you to at least locate the problem as being inside the box or outside the box.

You also have to know the limitations of your tools. Stepping through source code shows logic errors but not timing or multithread problems; profiling tools can expose timing problems but not logic flaws.

Now, if Charlie were working at my company, and you asked him, “What do you do when you find a failure?” he would answer, “Try to make it fail again.” (Charlie is well trained.) There are three reasons for trying to make it fail:

  1. So you can look at it.
  2. So you can focus on the cause. Knowing under exactly what conditions it will fail helps you focus on probable causes.
  3. So you can tell if you’ve fixed it. Once you think you’ve fixed the problem, having a surefire way to make it fail gives you a surefire test of whether you fixed it.

If without the fix it fails 100 percent of the time when you do X, and with the fix it fails zero times when you do X, you know you’ve really fixed the bug. (This is not silly. Many times an engineer will change the software to fix a bug, then test the new software under different conditions from those that exposed the bug. It would have worked even if he had typed limericks into the code, but he goes home happy. And weeks later, in testing or, worse, at the customer site, it fails again. More on this later, too.)

You have enough bugs already; don’t try to create new ones. Use instrumentation to look at what’s going wrong (see Rule 3: Quit Thinking and Look), but don’t change the mechanism; that’s what’s causing the failure.

How many engineers does it take to fix a lightbulb? A: None; they all say, “I can’t reproduce it—the lightbulb in my office works fine.”

Automation can make an intermittent problem happen much more quickly, as in the TV game story. Amplification can make a subtle problem much more obvious, as in the leaky window example, where I could locate the leak better with a hose than with the occasional rainstorm. Both of these techniques help stimulate the failure, without simulating the mechanism that’s failing. Make your changes at a high enough level that they don’t affect how the system fails, just how often.

What can you do to control these other conditions? First of all, figure out what they are. In software, look for uninitialized data (tsk, tsk!), random data input, timing variations, multithread synchronization, and outside devices (like the phone network or the six thousand kids clicking on your Web site). In hardware, look for noise, vibration, temperature, timing, and parts variations (type or vendor). In my all-wheel-drive example, the problem would have seemed intermittent if I hadn’t noticed the temperature and the speed.

Once you have an idea of what conditions might be affecting the system, you simply have to try a lot of variations. Initialize those arrays and put a known pattern into the inputs of your erratic software. Try to control the timing and then vary it to see if you can get the system to fail at a particular setting. Shake, heat, chill, inject noise into, and tweak the clock speed and the power supply voltage of that unreliable circuit board until you see some change in the frequency of failure.

Sometimes you’ll find that controlling a condition makes the problem go away. You’ve discovered something—what condition, when random, is causing the failure. If this happens, of course, you want to try every possible value of that condition until you hit the one that causes the system to fail. Try every possible input data pattern if a random one fails occasionally and a controlled one doesn’t.

One caution: Watch out that the amplified condition isn’t just causing a new error. If a board has a temperature-sensitive error and you decide to vibrate it until all the chips come loose, you’ll get more errors, but they won’t have anything to do with the original problem.

You have to be able to look at the failure. If it doesn’t happen every time, you have to look at it each time it fails, while ignoring the many times it doesn’t fail. The key is to capture information on every run so you can look at it after you know that it’s failed. Do this by having the system output as much information as possible while it’s running and recording this information in a “debug log” file.

By looking at captured information, you can easily compare a bad run to a good one (see Rule 5: Change One Thing at a Time). If you capture the right information, you will be able to see some difference between a good case and a failure case. Note carefully the things that happen only in the failure cases. This is what you look at when you actually start to debug.

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” —SHERLOCK HOLMES, A SCANDAL IN BOHEMIA

We lost several months chasing the wrong thing because we guessed at the failure instead of looking at it.

Actually seeing the low-level failure is crucial. If you guess at how something is failing, you often fix something that isn’t the bug. Not only does the fix not work, but it takes time and money and may even break something else. Don’t do it.

“Quit thinking and look.” I make this statement to engineers more often than any other piece of debugging advice. We sometimes tease engineers who come up with an idea that seems pretty good on the surface, but on further examination isn’t very good at all, by saying, “Well, he’s a thinker.” All engineers are thinkers. Engineers like to think.

So why do we imagine we can find the problem by thinking about it? Because we’re engineers, and thinking is easier than looking.

What we see when we note the bug is the result of the failure: I turned on the switch and the light didn’t come on. But what was the actual failure? Was it that the electricity couldn’t get through the broken switch, or that it couldn’t get through the broken bulb filament? (Or did I flip the wrong switch?) You have to look closely to see the failure in enough detail to debug it.

Many problems are easily misinterpreted if you can’t see all the way to what’s actually happening. You end up fixing something that you’ve guessed is the problem, but in fact it was something completely different that failed.

How deep should you go before you stop looking and start thinking again? The simple answer is, “Keep looking until the failure you can see has a limited number of possible causes to examine.”

Experience helps here, as does understanding your system. As you make and chase bad guesses, you’ll get a feel for how deep you have to see in a given case. You’ll know when the failure you see implicates a small enough piece of the design. And you’ll understand that the measure of a good debugger is not how soon you come up with a guess or how good your guesses are, but how few bad guesses you actually act on.

Seeing the failure in low-level detail has another advantage in dealing with intermittent bugs, which we’ve discussed before and will discuss again: Once you have this view of the failure, when you think you’ve fixed the bug, it’s easy to prove that you did fix the bug. You don’t have to rely on statistics; you can see that the error doesn’t happen anymore. When our senior engineer fixed the noise problem on our slave microprocessor, he could see that the glitch in the write pulse was gone.

In the world of electronic hardware, this means test points, test points, and more test points. Add a test connector to allow easy access to buses and important signals. These days, with programmable gate arrays and application-specific integrated circuits, the problem is often buried inside a chunk of logic that you can’t get into with external instruments, so the more signals you can bring out of the chip, the better off you’ll be.

As mentioned in “Make It Fail,” even minor changes can affect the system enough to hide the bug completely. Instrumentation is one of those changes, so after you’ve added instrumentation to a failing system, make it fail again to prove that Heisenberg isn’t biting you.

“Quit Thinking and Look” doesn’t mean that you don’t ever make any guesses about what might be wrong. Guessing is a pretty good thing, especially if you understand the system. Your guesses may even be pretty close, but you should guess only to focus the search. You still have to confirm that your guess is correct by seeing the failure before you go about trying to fix the failure.

So don’t trust your guesses too much; often they’re way off and will lead you down the wrong path. If it turns out that careful instrumentation doesn’t confirm a particular guess, then it’s time to back up and guess again. (Or reconsult your Ouija board, or throw another dart at your “bug cause” dartboard—whatever your methodology may be. I recommend the use of Rule 4: “Divide and Conquer.

An exception: One reason for guessing in a particular way is that some problems are more likely than others or easier to fix than others, so you check those out first. In fact, when you make a particular guess because that problem is both very likely and easy to fix, that’s the one time you should try a fix without actually seeing the details of the failure.

You can think up thousands of possible reasons for a failure. You can see only the actual cause. See the failure. The senior engineer saw the real failure and was able to find the cause. The junior guys thought they knew what the failure was and fixed something that wasn’t broken. See the details. Don’t stop when you hear the pump. Go down to the basement and find out which pump. Build instrumentation in. Use source code debuggers, debug logs, status messages, flashing lights, and rotten egg odors. Add instrumentation on. Use analyzers, scopes, meters, metal detectors, electrocardiography machines, and soap bubbles. Don’t be afraid to dive in. So it’s production software. It’s broken, and you’ll have to open it up to fix it. Watch out for Heisenberg. Don’t let your instruments overwhelm your system. Guess only to focus the search. Go ahead and guess that the memory timing is bad, but look at it before you build a timing fixer.

After he did this with all the pins, he plugged in the lines and proved that the terminals worked. Then he unplugged everything, reassembled the box, plugged everything back in, and reconfirmed that the terminals worked. This was to accommodate Goldberg’s Corollary to Murphy’s Law, which states that reassembling any more than is absolutely necessary before testing makes it probable that you have not fixed the problem and will have to disassemble everything again, with a probability that increases in proportion to the amount of reassembly effort involved.

But the rule that our technician demonstrated particularly well with this debugging session was “Divide and Conquer.” He narrowed the search by repeatedly splitting up the search space into a good half and a bad half, then looking further into the bad half for the problem.

Narrow the search. Home in on the problem. Find the range of the target. The common technique used in any efficient target search is called successive approximation—you want to find something within a range of possibilities, so you start at one end of the range, then go halfway to the other end and see if you’re past it or not. If you’re past it, you go to one-fourth and try again. If you’re not past it, you go to three-fourths and try again. Each try, you figure out which direction the target is from where you are and move half the distance of the previous move toward the target. After a small number of attempts, you home right in on it.

When you inject known input patterns, of course, you should be careful that you don’t change the bug by setting up new conditions. If the bug is pattern dependent, putting in an artificial pattern may hide the problem. “Make It Fail” before proceeding.

So when you do figure out one of several simultaneous problems, fix it right away, before you look for the others.

I’ve often heard someone say, “Well, that’s broken, but it couldn’t possibly affect the problem we’re trying to find.” Guess what—it can, and it often does. If you fix something that you know is wrong, you get a clean look at the other issues. Our hotel technician was able to see the direction of the bad wiring on the really slow terminal only after he fixed the high resistances in both directions in the breakout box.

A corollary to the previous rule is that certain kinds of bugs are likely to cause other bugs, so you should look for and fix them first. In hardware, noisy signals cause all kinds of hard-to-find, intermittent problems. Glitches and ringing on clocks, noise on analog signals, jittery timing, and bad voltage levels need to be taken care of before you look at other problems; the other problems are often very unpredictable and go away when you fix the noise. In software, bad multithread synchronization, accidentally reentrant routines, and uninitialized variables inject that extra shot of randomness that can make your job hell.

It’s also easy to become a perfectionist and start “fixing” every instance of bad design practice you find, in the interest of general quality. You can eliminate the GOTOs in your predecessor’s code simply because you consider them nasty, but if they aren’t actually causing problems, you’re usually better off leaving them alone.

When his change didn’t fix the problem, he should have backed it out immediately.

Change one thing at a time. You’ve heard of the shotgun approach? Forget it. Get yourself a good rifle. You’ll be a lot better at fixing bugs.

If you’re working on a mortgage calculation program that seems to be messing up on occasional loans, pin down the loan amount and the term, and vary the interest rate to see that the program does the right thing. If that works, pin down the term and interest rate, and vary the loan amount. If that works, vary just the term. You’ll either find the problem in your calculation or discover something really surprising like a math error in your Pentium processor. (“But that can’t happen!”) This kind of surprise happens with the more complex bugs; that’s what makes them complex. Isolating and controlling variables is kind of like putting known data into the system: It helps you see the surprises.

Sometimes, changing the test sequence or some operating parameter makes a problem occur more regularly; this helps you see the failure and may be a great clue to what’s going on. But you should still change only one thing at a time so that you can tell exactly which parameter had the effect. And if a change doesn’t seem to have an effect, back it out right away!

What you’re looking for is never the same as last time. And it takes a fair amount of knowledge and intelligence to sift through the irrelevant differences, differences that were caused by timing or other factors. This knowledge is beyond what a beginner has, and this intelligence is beyond what software can do. (A.I. waved good-bye, remember?) The most that software can do is help you to format and filter logs so that when you apply your superior human brain (you do have a superior one, don’t you?) to analyzing the logs, the differences (and possibly the cause of the differences) will jump right out at that brain.

Sometimes the difference between a working system and a broken one is that a design change was made. When this happens, a good system starts failing. It’s very helpful to figure out which version first caused the system to fail, even if this involves going back and testing successively older versions until the failure goes away.

The point of this story is that sometimes it’s the most insignificant-seeming thing that’s actually the key to making a bug happen. What seems insignificant to the person doing the testing (the plaid shirt) may be important to the person trying to fix the problem. And what seems obvious to the tester (the chip had to be restarted) may be completely missed by the fixer. So you have to take note of everything—on the off chance that it might be important and nonobvious.

Keep an audit trail. As you investigate a problem, write down what you did, what order you did it in, and what happened as a result. Do this every time. It’s just like instrumenting the software or the hardware—you’re instrumenting the test sequence. You have to be able to see what each step was, and what the result was, to determine which step to focus on during debugging.

Unfortunately, while the value of an audit trail is often accepted, the level of detail required is not, so a lot of important information gets left out. What kind of system was running? What was the sequence of events leading up to the failure? And sometimes even, what was the actual failure? (Duh!) Sometimes the report just says, “It’s broken.” It doesn’t say that the graphics were completely garbled or that all the red areas came up green or that the third number was wrong. It just says it failed.

Tool control is critical to the accurate re-creation of a version, and you should make sure you have it. As discussed later, unrecognized tool variations can cause some very strange effects.

Never trust your memory with a detail—write it down. If you trust your memory, a number of things will happen. You’ll forget the details that you didn’t think were important at the time, and those, of course, will prove to be the critical ones. You’ll forget the details that actually weren’t important to you, but might be important to someone else working on a different problem later.

You won’t be able to transmit information to anyone else except verbally, which wastes everybody’s time, assuming you’re still around to talk about it. And you won’t be able to remember exactly how things happened and in what order and how events related to one another, all of which is crucial information.

Write it down. It’s better to write it electronically so you can make backup copies, attach it to bug reports, distribute it to others easily, and maybe even filter it with automated analysis tools later. Write down what you did and what happened as a result. Save your debug logs and traces, and annotate them with related events and effects that they don’t inherently record themselves. Write down your theories and your fixes. Write it all down.

This is more likely to happen with what I call “overhead” or “foundation” factors. Because they’re general requirements (electricity, heat, clock), they get overlooked when you debug the details.

It’s classic to say, “Hmm, this new code works just like the old code” and then find out that, in fact, you didn’t actually load the new code. You loaded the old code, or you loaded the new code but it’s still executing the old code because you didn’t reboot your computer or you left an easier-to-find copy of the old code on your system.

Speaking of starting your car, another aspect to consider is whether the start-up conditions are correct. You may have the power plugged in, but did you hit the start button? Has the graphics driver been initialized? Has the chip been reset? Have the registers been programmed correctly? Did you push the primer button on your weed whacker three times? Did you set the choke? Did you set the on/off switch to on? (I usually notice this one after six or seven fruitless pulls.)

If you depend on memory being initialized before your program runs, but you don’t do it explicitly, it’s even worse—sometimes startup conditions will be correct. But not when you demo the program to the investors.

Your bad assumptions may not be about the product you’re building, but rather about the tools you’re using to build it, as in the consultant story.

Default settings are a common problem. Building for the wrong environment is another—if you use a Macintosh compiler, you obviously can’t run the program on an Intel PC, but what about your libraries and other common code resources?

It may not be just your assumptions about the tools that are bad—the tools may have bugs, too. (Actually, even this is just your bad assumption that the tool is bug-free. It was built by engineers; why would it be any more trustworthy than what you’re building?)

“Nothing clears up a case so much as stating it to another person.” —SHERLOCK HOLMES, SILVER BLAZE

There are at least three reasons to ask for help, not counting the desire to dump the whole problem into someone else’s lap: a fresh view, expertise, and experience. And people are usually willing to help because it gives them a chance to demonstrate how clever they are.

It’s hard to see the big picture from the bottom of a rut. We’re all human. We all have our biases about everything, including where a bug is hiding. Those biases can keep us from seeing what’s really going on. Someone who comes at the problem from an unbiased (actually, differently biased) viewpoint can give us great insights and trigger new approaches. If nothing else, that person can at least tell you it looks like you’ve got a nasty problem there and offer you a shoulder to cry on.

In fact, sometimes explaining the problem to someone else gives you a fresh view, and you solve the problem yourself. Just organizing the facts forces you out of the rut you were in. I’ve even heard of a company that has a room with a mannequin in it—you go explain your problems to the mannequin first. I imagine the mannequin is quite useful and contributes to the quick solution of a number of problems. (It’s probably more interactive than some people you work with. I bet it’s also a very forgiving listener, and none of your regrettable misconceptions will show up in next year’s salary review.)

There are occasions where a part of the system is a mystery to us; rather than go to school for a year to learn about it, we can ask an expert and learn what we need to know quickly. But be sure that your expert really is an expert on the subject—if he gives you vague, buzzword-laden theories, he’s a technical charlatan and won’t be helpful. If he tells you it’ll take thirty hours to research and prepare a report, he’s a consultant and may be helpful, but at a price.

In any case, experts “Understand the System” better than we do, so they know the road map and can give us great search hints. And when we’ve found the bug, they can help us design a proper fix that won’t mess up the rest of the system.

You may not have a whole lot of experience, but there may be people around you who have seen this situation before and, given a quick description of what’s going on, can tell you exactly what’s wrong, like the dome light short in the story. Like experts, people with experience in a specific area can be hard to find, and thus expensive. It may be worth the money:

If you’re dealing with equipment or software from a third-party vendor, e-mail or call that vendor. (If nothing else, the vendor will appreciate the bug report.) Usually, you’ll be told about some common misunderstanding you’ve got; remember, the vendor has both expertise and experience with its product.

This is a good time to reiterate “Read the Manual.” In fact, this is where the advice becomes “When All Else Fails, Read the Manual Again.” Yes, you already read it because you followed the first rule faithfully. Look at it again, with your newfound focus on your particular problem—you may see or understand something that you didn’t before.

You may be afraid to ask for help; you may think it’s a sign of incompetence. On the contrary, it’s a sign of true eagerness to get the bug fixed. If you bring in the right insight, expertise, and/or experience, you get the bug fixed faster. That doesn’t reflect poorly on you; if anything, it means you chose your help wisely.

The opposite is also true: Don’t assume that you’re an idiot and the expert is a god. Sometimes experts screw up, and you can go crazy if you assume it’s your mistake.

No matter what kind of help you bring in, when you describe the problem, keep one thing in mind: Report symptoms, not theories. The reason you went to someone else for fresh insight is that your theories aren’t getting you anywhere. If you go to somebody fresh and lay a theory on her, you drag her right down into the same rut you’re in.

I’ve seen this mistake a lot; the help is brought in and immediately poisoned with the old, nonworking theories. (If the theories were any good, there’d be no need to bring in help.) In some cases, a good helper can plow through all that garbage and still get to the facts, but more often you just end up with a big crowd down in that rut.

This rule works both ways. If you’re the helper, cut off the helpee who starts to tell you theories. Cover your ears and chant “La-La-LaLa-La.” Run away. Don’t be poisoned.

If you follow the “Make It Fail” rule, you’ll know how to prove that you’ve fixed the problem. Do it! Don’t assume that the fix works; test it. No matter how obvious the problem and the fix seem to be, you can’t be sure until you test it. You might be able to charge some young sucker seventy-five bucks, but he won’t be happy about it when it still fails.

When you think you’ve fixed an engineering design, take the fix out. Make sure it’s broken again. Put the fix back in. Make sure it’s fixed again. Until you’ve cycled from fixed to broken and back to fixed again, changing only the intended fix, you haven’t proved that you fixed it.

If you didn’t fix it, it ain’t fixed. Everyone wants to believe that the bug just went away. “We can’t seem to make it fail anymore.” “It happened that couple of times, but gee, then something happened and it stopped failing.” And of course the logical conclusion is, “Maybe it won’t happen again.” Guess what? It will.

When they report the problem, it’s better to say, “Thanks! We’ve been trying for months to capture that incredibly rare occurrence; please e-mail us the log file,” than to say, “Wow, incredible. That’s never happened here.”

While we’re on the subject of design quality, I’ve often thought of ISO-9000 as a method for keeping an audit trail of the design process. In this case, the bugs you’re trying to find are in the process (neglecting vibration) rather than in the product (leaky fitting), but the audit trail works the same way. As advertised in the introduction, the methods presented in this book are truly general purpose.

“That can’t happen” is a statement made by someone who has only thought about why something can’t happen and hasn’t looked at it actually happening.

New hardware designs with intermittent errors are often the victims of noise (stray voltages on a wire that turn a 1 into a 0 or vice versa).

By introducing additional noise into the system, I was able to make it fail more often. This was a bit lucky, though. For some circuits, touching a finger is like testing that leaky window with a fire hose—the noise will cause a perfectly good circuit to fail. Sometimes the opposite will occur; the capacitance of the finger will eliminate the noise and quiet the system down.

I’ve established a Web site at http://www.debuggingrules.com that’s dedicated to the advancement of debugging skills everywhere. You should visit, if for no other reason than to download the fancy Debugging Rules poster—you can print it yourself and adorn your office wall as recommended in the book. There are also links to various other resources that you may find useful in your debugging education. And I’ll always be interested in hearing your interesting, humorous, or instructive (preferably all three) war stories; the Web site will tell you how to send them. Check it out.

You can appeal to their sense of team communication. Several of the people who reviewed drafts of the book were team leaders who consider themselves excellent debuggers, but they found that the rules crystallized what they do in terms that they could more easily communicate to their teams. They found it easy to guide their engineers by saying, “Quit Thinking and Look,” the way I’ve been doing for twenty years.

Related Posts