System Design

Embedded Systems Architecture Resources

Updated: 20190717

After a decade spent building and shipping hardware products, I became convinced that many of the problems and schedule delays I experienced could have been avoided with a little bit of planning and thought. Repeatedly, we painted ourselves into corners with code that seemed to work well initially but caused problems months later when we finally started end-to-end system testing. Serious problems resulted in major software rewrites, changes in the technology stack, and delayed the ship date. Even worse, as I migrated from one product team to another, I noticed that we were repeating the same basic mistakes.

I started pondering this situation. Why were we dealing major design problems and risk areas at the end of the project instead of the beginning? How could we describe ourselves as "agile" if we weren't able to quickly adapt our programs to change? Why did none of the teams I was on use anything resembling design activity before starting to build a system?

These questions led me to a deep immersion in the topics of software architecture, systems thinking, and system design. I've applied countless lessons to our internal projects, client projects, and business development efforts. Value exploration, visual modeling, and minimalistic architecture efforts have significantly improved our work quality and derisked many projects.

"Architecture" and "design" seem to be words that send programming teams running for the hills. However, I've had multiple embedded developers share their frustrations with me - the same that started me on my journey - and expressed their interest in learning more about software architecture but not knowing where to start. So, here are all the resources I've collected on software architecture. I hope they help guide you in your own journey.

Table of Contents:

Where to Start?

There's a lot of material here! You don't need to read all of it to get started with architecture.

For general architecture exposure, I recommend picking 1-2 books from this list:

If you are focused on embedded systems, I highly recommend Real-Time Software Design for Embedded Systems. This book provides a blueprint for modeling and architecting embedded systems. You will be introduced to UML and a variety of modeling approaches that you can use when architecting an embedded system.

The next step is to actually practice! There is no need for a long, drawn-out architecture stage. Allocate 2-4 weeks for value exploration and architecture efforts before starting any new project. Perform stakeholder interviews and explore the value you expect the system to provide. Then focus on answering core questions, like:

  • What qualities and behaviors are most important?
  • What requirements do they place on the design?
  • What are the biggest risk areas?
  • How can we reduce risk?
  • What are we unsure about that might change?
  • How can we make sure to support those changes without requiring a system redesign?
  • What parts of the system will we buy, license, outsource, and build in house?

Those questions will inform the architecture effort. Model the system and begin prototyping the riskiest areas. As you develop the system, you will explore and refine the system architecture.

General Software Architecture

Before diving into embedded systems specifics, it is helpful to have a solid foundation in general software architecture techniques.

We've broken down our reading recommendations into the following categories:

What is Architecture?

Before diving into the how of architecture, it's helpful to know what it is.

Why Should We Architect?

Perhaps you're not convinced that architecture is valuable. Or perhaps you need to prepare yourself to advocate for architecture efforts on your projects. These articles will give you some insights into why we architect.

The Architect Role

These articles discuss the architect role itself, particularly the qualities and skillsets that are valuable to an architect.

Architecting

We recommend the following architecture books:

These articles from around the web provide countless insights into the practice of software architecture:

Phil Koopman has a selection of lectures which are generally applicable to architecture and design:

Additionally, the slides and course notes from Hassan Gomaa are a useful introduction:

Here are talks which relate to the subject of architecture:

Techniques

Here are some practical technique guides related to the architecture process, ideation, brainstorming, and value exploration.

Documentation

Architecture work and documentation go hand in hand. Here are valuable resources on the that discuss architecture documentation:

Visual Architecture Process

These guides relate to Bredemeyer Consulting's Visual Architecture Process. They provide a practical blueprint for architecting your systems.

C4 Process

Simon Brown created the C4 architecture model, which focuses on four areas of architecture: Context, containers, components, and code. This is another practical blueprint for architecting your system.

Embedded Systems Architecture

Even just a little exposure to software architecture will reveal how deep the rabbit hole goes. We're focused on embedded systems, so here are embedded-specific resources.

Our favorite books on the subject of embedded systems architecture are:

Hassan Gomaa, a professor at George Mason University, published course notes for two courses which discuss embedded systems architecture and modeling:

Phil Koopman published the following course notes which are useful for embedded systems architects:

Safety and Critical Systems

Here are lectures, coures notes, and essays related to architecting for safety and for critical systems:

Security

Here are lectures, coures notes, and essays related to architecting for security:

Systems Thinking

I would be remiss to talk about architecture without mentioning systems thinking. These two topics are intertwined: we must develop a habit of thinking about the system as a whole if we are to work at an architectural level.

Here are some of my favorite books and essays on systems thinking:

Design Patterns

Design patterns are extremely useful to learn and familiarize yourself with. These are non-obvious solutions to common scenarios and problems. For generally useful software architecture patterns, see:

Embedded systems often work well with event-driven architectures and/or state machines. For more information, see:

Embedded systems are often under tight memory constraints. A useful reference for embedded developers is:

Layered or Hexagonal architectures are common abstractions that work well for embedded systems. Here are some links on both types of design:

Here are design patterns related to safety and critical systems:

Here are anti-patterns to avoid:

Visual Modeling

UML

UML is frequently trashed by development teams (even those with no experience using it), but I find "UML-light" to be extremely useful for documenting and modeling my systems.

These books are wonderful resources for learning and applying UML:

Here are lectures related to UML:

As far as UML tools go, there are many options. We recommend three:

  • Visual Paradigm is our tool of choice due to its support of SysML and the ability to tweak the models to support our needs
  • StarUML is a UML modeling tool recommended to us by Grady Booch, who says he uses this tool on a regular basis
  • PlantUML is a great tool which generates UML diagrams from textual descriptions, enabling you to store UML diagrams under revision control and to include them in source-code comments

C4

If you prefer the C4 model, we recommend the following:

Who to Follow

You've already seen these names quite a bit throughout the article. I recommend keeping up with these folks:

Architecture on Embedded Artistry

We publish articles related to Architecture and Systems Thinking on on this website.

Architecture Articles

Systems Thinking Articles

Books Mentioned Above

Documenting Software Architectures: Views and Beyond (2nd Edition)
By Paul Clements, Felix Bachmann, Len Bass, David Garlan, James Ivers, Reed Little, Paulo Merson, Robert Nord, Judith Stafford
Design Patterns: Elements of Reusable Object-Oriented Software
By Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides
Pattern-Oriented Software Architecture Volume 1: A System of Patterns
By Frank Buschmann, Regine Meunier, Hans Rohnert, Peter Sommerlad, Michael Stal, Michael Stal

Change Log

What can software organizations learn from the Boeing 737 MAX saga?

Updated: 20190524

One of the largest news stories over the past month was the grounding of Boeing 737 MAX-8 and MAX-9 aircraft after an Ethiopian Airlines crash resulted in the deaths of everyone on board. This is the second deadly crash of involving a Boeing 737 MAX. A Lion Air Boeing 737 MAX-8 crashed in October 2018, also killing everyone on board. As a result of these two crashes, Boeing 737 MAX airplanes have been temporarily grounded in over 41 countries, including China, the US, and Canada. Boeing also paused delivery of these planes, although they are continuing to produce them.

I have been following the Boeing 737 MAX story closely. It serves as an interesting case study on software and systems engineering, human factors, corporate behavior, and customer service.

*Note: Both the Lion Air and Ethiopian Airlines crashes are still under investigation. Ultimately, everything you are reading about these crashes and that I discuss here is still in the realm of speculation. However, the situation is serious enough and well-enough understood that Boeing is addressing the problem immediately.*

Table of Contents:

Brief Background on the 737 MAX

Before diving into the suspected problem with the 737 MAX, I need to set the stage with some background information about the aircraft.

The Boeing 737 is the best-selling aircraft in the world, with over 15,000 planes sold. After Airbus announced an upgrade to the A320 that provided 14% better fuel economy per seat, Boeing responded with the 737 MAX. Boeing sold the 737 MAX as an "upgrade" to the famed 737 design, using larger engines for improved fuel efficiency (also by 14%). Boeing claimed that the 737 MAX operated and flew in the same way as the 737 NG, so pilots licensed to fly the 737 NG did not need additional training and simulator time for the 737 MAX.

Because Boeing increased the engine size to improve fuel efficiency, the engines needed to be positioned higher up on the plane's wings and slightly forward of the old position. Higher nose landing gear was also added to provide the same ground clearance as the 737NG.

The larger engines and new positions destabilized the aircraft, but not under all conditions. The engine housings were designed so they do not generate lift in normal flight. However, if the airplane is in a steep pitch (e.g., takeoff or a hard turn), the engine housings generate more lift than on previous 737 models. Depending on the angle, the airplane's inertia can cause the plane to over-swing into a stall.

To address the increased stall risk, Boeing developed a software solution: the Maneuvering Characteristics Augmentation System (MCAS). No other commercial plane uses a system like the MCAS, though Boeing uses a similar MCAS system on the KC-46 Pegasus military aircraft.

The MCAS is part of the flight management computer software. The pilot and co-pilot each have their own flight computer, but only one has control at a time. The MCAS takes readings from the angle of attack (AoA) sensor to determine how the plane's nose is pointed relative to the oncoming air. The MCAS monitors airspeed, altitude, and AoA. When the MCAS determines that the angle of attack is too great, it automatically performs two actions to prevent a stall:

  1. Command the aircraft's trim system to adjust the rear stabilizer and lower the nose
  2. Push the pilot's yoke in the down direction

The movement of the rear stabilizer varies with the speed of the plane. The stabilizer moves more at slower speeds and less at higher speeds.

By default, the MCAS is active when:

  • AoA is high (ascent, steep turn)
  • Autopilot is off
  • Flaps are up

The MCAS will deactivate once:

  • The AoA measurement is below the target threshold
  • The pilot overrides the system with a manual trim setting
  • The pilot engages the CUTOUT switch, which disables automatic control of the stabilizer trim

If the pilot overrides the MCAS with trim controls, it will activate again within five seconds after the trim switches are released if the sensors still detect an AoA over the threshold. The only way to completely disable the system is to use the CUTOUT switch and take manual trim control.

Note this important point: Boeing designed the MCAS to not turn off in response to a pilot manually pulling the yoke. Doing so would defeat the original purpose of the MCAS, which is to prevent the pilot from inadvertently entering a stall angle.

I highlight this point because a natural reaction to a plane that is pitching downward is to pull on the yoke. You are applying a counter-force to correct for the unexpected motion. For normal autopilot trim or runaway manual trim, pulling on the yoke does what you expect and triggers trim hold sensors.

We are under the impression that the column, yoke, steering wheel, gas pedal, and brakes fully control the response of the mechanical system. This is an illusion. Modern aircraft, like most modern cars, are "fly-by-wire". Gone are the days of direct mechanical connections involving cables and hydraulic lines. Instead, most of the connections are purely electrical and typically mediated by a computer. In many ways we are being continually "guarded" by the computers that mediate these connections. It can be a terrible shock when the machine fights against you.

The Suspected Problem

The MCAS is suspected to have played a significant role in both crashes.

During Lion Air flight JT610, MCAS repeatedly forced the plane's nose down, even when the plane was not stalling. The pilots tried to correct by pointing the nose higher, but the system kept pushing it down again. This up-and-down oscillation happened 21 times before the crash occurred. The Ethiopian Airlines crash shows a similar pattern. The Ethiopian Airlines CEO said that they believed that the MCAS was active during the Ethiopian Airlines crash.

Image from the Lion Air crash preliminary report. Notice how the Automatic Trim (yellow line) was forcing the aircraft down, and the pilots countered by pointing it back up (light blue line above Automatic Trim).

If the plane wasn't actually stalling, or even close to a stall angle, why was MCAS engaged?

AoA sensors can be unreliable, which is a suggested factor in the Lion Air crash, where there was a 20-degree discrepancy in AoA sensor readings. The MCAS only reads the AoA sensor on its corresponding side of the plane. The MCAS reacts to the reading faithfully and does not cross-check the other sensor to confirm the reading. If a sensor goes haywire, the MCAS has no way of knowing.

If the MCAS was enabled erroneously, why did the pilots not disable the system?

This is where the situation becomes muddled. The likeliest explanation for the Lion Air pilots is that they had no idea that the MCAS existed, that it was active, or how they could disable it.

Remember, the MCAS is a unique piece of software among commercial airplanes; it only runs on the 737 MAX. Boeing sold and certified the 737 MAX as a minor upgrade to the 737 body, which would not require pilots to re-certify or spend time training in simulators. As a result, it seems that the existence of the MCAS was largely kept quiet.

“We do not like the fact that a new system was put on the aircraft and wasn’t disclosed to anyone or put in the manuals."

  • Jon Weaks, president of Southwest Airlines Pilots Association

"This is the first description you, as 737 pilots, have seen. It is not in the AA 737 Flight Manual Part 2, nor is there a description in the Boeing FCOM (flight crew operations manual). It will be soon."

  • Message to APA from Capt. Mike Michaelis

After the Lion Air crash, Boeing released a bulletin providing details on how the system worked and how to counter-act it in case of malfunction. Boeing announced that the MCAS could move the stabilizer by 2.5 degrees. This movement limit applies separately for each time the MCAS is activated. Boeing confirmed that MCAS can move the stabilizer to its full downward position if the pilot did not counteract it with manual trimming or completely cutting out the system. With a limit of 2.5 degrees, two cycles of MCAS without pilot correction is enough to reach full downward position.

Boeing also said that emergency procedures that applied to earlier 737 models would have corrected the problems observed in the Lion Air crash.

The Lion Air pilots likely fought against an automated system that was working against them. The system is most likely to activate at low altitudes, such as during takeoff, leaving the pilots little time to react. Their search through the technical manuals proved unsuccessful.

The Ethiopian Airlines pilots had heard about MCAS thanks to the bulletin, although one pilot commented, "we know more about the MCAS system from the media than from Boeing". Ethiopian Airlines installed one of the first simulators for the 737 MAX, but the pilot of the doomed flight had not yet received training in the simulator. All we know at this time is that the pilot reported "flight control problems" and wanted to return to the airport and that the Ethiopian Airlines crash resembles the Lion Air crash. We must wait for the preliminary report for more details.

Compounding Factors

Based on our current knowledge, the first-level analysis leads us to believe that the MCAS system was poorly designed and caused two plane crashes.

It's not quite that simple. This is a complex situation, involving many people and organizations. Other pilots have struggled against the MCAS system and safely guided their passengers to their destination.

The following contributing factors play out time-and-again in other systems.

Poor Documentation

As I mentioned, after the Lion Air crash, pilots complained that they were not told about the MCAS or trained in how to respond when the system engages unexpectedly. This lack of documentation or training is especially dangerous when you are fighting against an automated system and your previous training does not fully apply (recall that pulling on the yoke to hold against the trim does not work against the MCAS). Even worse, Lion Air pilots attempted to find answers in their manuals before they crashed.

Pilots take their documentation extremely seriously. Below are three reports from the Aviation Safety Reporting System (ASRS), which is run by NASA to provide pilots and crews with a way to report safety issues confidentially.

The reports highlighted below focus on the insufficiency of Boeing 737 MAX documentation. I've bolded some sentences for emphasis.

ACN 1593017

Synopsis:

B737MAX Captain expressed concern that some systems such as the MCAS are not fully described in the aircraft Flight Manual.

Highlights from the narrative:

This description is not currently in the 737 Flight Manual Part 2, nor the Boeing FCOM, though it will be added to them soon. This communication highlights that an entire system is not described in our Flight Manual. This system is now the subject of an AD.

I think it is unconscionable that a manufacturer, the FAA, and the airlines would have pilots flying an airplane without adequately training, or even providing available resources and sufficient documentation to understand the highly complex systems that differentiate this aircraft from prior models. The fact that this airplane requires such jury rigging to fly is a red flag. Now we know the systems employed are error prone--even if the pilots aren't sure what those systems are, what redundancies are in place, and failure modes.

I am left to wonder: what else don't I know? The Flight Manual is inadequate and almost criminally insufficient. All airlines that operate the MAX must insist that Boeing incorporate ALL systems in their manuals.

ACN 1593021

Synopsis:

B737MAX Captain reported confusion regarding switch function and display annunciations related to "poor training and even poorer documentation".

Highlights from narrative:

This is very poorly explained. I have no idea what switch the preflight is talking about, nor do I understand even now what this switch does.

I think this entire setup needs to be thoroughly explained to pilots. How can a Captain not know what switch is meant during a preflight setup? Poor training and even poorer documentation, that is how.

It is not reassuring when a light cannot be explained or understood by the pilots, even after referencing their flight manuals. It is especially concerning when every other MAINT annunciation means something bad. I envision some delayed departures as conscientious pilots try to resolve the meaning of the MAINT annunciation and which switches are referred to in the setup.

ACN 1555013

Synopsis:

B737 MAX First Officer reported feeling unprepared for first flight in the MAX, citing inadequate training.

Highlights from narrative:

I had my first flight on the Max [to] ZZZ1. We found out we were scheduled to fly the aircraft on the way to the airport in the limo. We had a little time [to] review the essentials in the car. Otherwise we would have walked onto the plane cold.

My post flight evaluation is that we lacked the knowledge to operate the aircraft in all weather and aircraft states safely. The instrumentation is completely different - My scan was degraded, slow and labored having had no experience w/ the new ND (Navigation Display) and ADI (Attitude Director Indicator) presentations/format or functions (manipulation between the screens and systems pages were not provided in training materials. If they were, I had no recollection of that material).

We were unable to navigate to systems pages and lacked the knowledge of what systems information was available to us in the different phases of flight. Our weather radar competency was inadequate to safely navigate significant weather on that dark and stormy night. These are just a few issues that were not addressed in our training.

Even worse, it appears that the FAA's System Safety Analysis document was also incorrect:

The original Boeing document provided to the FAA included a description specifying a limit to how much the system could move the horizontal tail — a limit of 0.6 degrees, out of a physical maximum of just less than 5 degrees of nose-down movement. [...] That limit was later increased after flight tests showed that a more powerful movement of the tail was required to avert a high-speed stall, when the plane is in danger of losing lift and spiraling down.

After the Lion Air Flight 610 crash, Boeing for the first time provided to airlines details about MCAS. Boeing’s bulletin to the airlines stated that the limit of MCAS’s command was 2.5 degrees. That number was new to FAA engineers who had seen 0.6 degrees in the safety assessment.

“The FAA believed the airplane was designed to the 0.6 limit, and that’s what the foreign regulatory authorities thought, too,” said an FAA engineer. “It makes a difference in your assessment of the hazard involved.”

I understand the pilots' concern, given that the MCAS could move the tail 4x farther than stated in the official safety analysis. What else is undocumented or documented incorrectly?

Rushed Release

I would bet that all engineers are familiar with rushed releases. We cut corners, make concessions, and ignore or mask problems - all so we can release a product by a specific date. Any problems are downplayed, and those that are observed by the customer can be fixed later in a patch.

Apparently, the 737 MAX was subject to the same treatment. Here are some key highlights from the article:

  • The FAA delegates some certification and technical assessments to airplane manufacturers, citing lack of funding and resources to carry out all operations internally
    • FAA managers have final authority on what gets delegated to the manufacturer
  • Boeing was under time pressure, because development of the 737 MAX was nine months behind the new A320neo
  • FAA technical experts said in interviews that managers prodded them to speed up the process
  • FAA safety engineer who was involved with certifying the 737 MAX was quoted saying that halfway through the certification process:
    • “We were asked by management to re-evaluate what would be delegated. Management thought we had retained too much at the FAA.”
    • “There was constant pressure to re-evaluate our initial decisions. And even after we had reassessed it […] there was continued discussion by management about delegating even more items down to the Boeing Company.”
    • “There wasn’t a complete and proper review of the documents. Review was rushed to reach certain certification dates.”
  • If there wasn't time for FAA staff to complete a review, FAA manages either signed off on the documents themselves or delegated the review to Boeing
  • As a result of this rushed process, a major change slipped through the process:
    • The System Safety Analysis on MCAS claims that the horizontal tail movement is limited to 0.6 degrees
    • This number was found to be insufficient for preventing a stall in worst-case scenarios
    • The number was increased 4x to 2.5 degrees
    • The FAA was never told about this changed, and FAA engineers did not learn about it until Boeing released the MCAS bulletin following the Lion Air crash

The New York Times corroborates this rushed released:

  • "The pace of the work on the 737 Max was frenetic, according to current and former employees who spoke with The New York Times."
    • “The timeline was extremely compressed,” the engineer said. “It was go, go, go.”
  • "One former designer on the team working on flight controls for the Max said the group had at times produced 16 technical drawings a week, double the normal rate."
  • "Facing tight deadlines and strict budgets, managers quickly pulled workers from other departments when someone left the Max project."
  • "Roughly six months after the project’s launch, engineers were already documenting the differences between the Max and its predecessor, meaning they already had preliminary designs for the Max — a fast turnaround, according to an engineer who worked on the project."
  • "A technician who assembles wiring on the Max said that in the first months of development, rushed designers were delivering sloppy blueprints to him. He was told that the instructions for the wiring would be cleaned up later in the process, he said."
    • "His internal assembly designs for the Max, he said, still include omissions today, like not specifying which tools to use to install a certain wire, a situation that could lead to a faulty connection. Normally such blueprints include intricate instructions."
  • "Despite the intense atmosphere, current and former employees said, they felt during the project that Boeing’s internal quality checks ensured the aircraft was safe"
  • “This program was a much more intense pressure cooker than I’ve ever been in,” he added. “The company was trying to avoid costs and trying to contain the level of change. They wanted the minimum change to simplify the training differences, minimum change to reduce costs, and to get it done quickly.”

I've worked on many fast-paced engineering projects. I've observed and personally made compromises to meet deadlines, and there are many that I disagreed with. All of these points are familiar and hit home. I was quite surprised to find that the culture that builds aircraft would be so similar to the culture that builds consumer electronics.

Delayed Software Updates

Weeks after the Lion Air crash, Boeing officials told the Southwest Airlines and American Airlines pilot's unions that they planned to have software updates available around the end of 2018.

“Boeing was going to have a software fix in the next five to six weeks,” said Michael Michaelis, the top safety official at the American Airlines pilots union and a Boeing 737 captain. “We told them, ‘Yeah, it can’t drag out.’ And well, here we are.”

The FAA told The Wall Street Journal that FAA work on the new MCAS software was delayed for five weeks by the government shutdown. However, the "enhancement" was submitted to the FAA for certification on 21 January, only four days before the shutdown ended.

The official software update was announced four months later than the initial estimate. It will still take many more months to approve and deploy.

We are all conditioned to waiting for fixes and updates. Teams are prone to giving idealistic estimates. Problems take longer than expected to diagnose, correct, and validate. Schedules are repeatedly overrun.

However, it's not going to comfort the families of those who lost their lives on Ethiopian Airlines Flight 302 that Boeing released a software fix for certification seven weeks before the fatal crash. There is a real cost to the delay of software updates, and that cost increases significantly with the impact of the issue. It is always better to take the necessary time to implement a robust design in order to avoid needing a patch at all.

Humans Were Out of the Loop

One uncomfortable computing fact remains true: humans are superior at dynamically receiving and synthesizing data.

Computers can only perform actions they were already programmed to do. A computer cannot take in additional data which it wasn't already programmed to read. The MCAS was designed to use a single data point, that of the AoA sensor on the corresponding side of the plane. The initial NTSC report on the Lion Air crash tells us that a single faulty AoA sensor triggered the MCAS.

If a pilot or co-pilot noticed a strange AoA reading (such as a 20-degree difference between the left and right AoA sensors), he or she could perform a "cross check" by glancing at the reading on the other side of the plane. Additional sensors and gauges can be read to corroborate or disprove a strange AoA reading. Hell, a pilot could even look out the window to get a sense of the plane's angle. The pilots could have a discussion and collectively determine which sensor they trusted. Our brains can take in any combination of this information and confirm/disprove a sensor reading.

What is even more troubling is that the system's behavior was opaque to the pilots. According to Boeing, the MCAS is (counter-intuitively) only active in manual flight mode, and is disabled when under autopilot. MCAS controls the trim without notifying the pilots that it is doing so.

Boeing did provide two optional features that would provide more insight into the situation:

  • An AoA indicator, which displays the sensor readings
  • An AoA disagree light, which lights up if the two AoA sensors disagree

But because these were optional, many carriers did not elect to buy them.

In a fight between an unaware human pilot and the MCAS, the MCAS has a fair chance of winning. Even if the pilot disables MCAS by setting a manual trim, MCAS would automatically kick back in if the high AoA reading was still detected. Combined with the fact that the MCAS could move the stabilizer 2.5 degrees per activation, it could continue to push the aircraft nose down until the stabilizer's force could no longer be overcome by the pilot's input.

Because of our superiority at dynamic information synthesis, humans must maintain the ability to override or overpower an automated process. At present, nothing in the world is as skilled at dealing with complexity and chaos as the human mind.

Boeing's Response

We've pointed a lot of fingers at Boeing, let’s take a moment to review what they are doing in response.

An MCAS software update has been announced:

Boeing has developed an MCAS software update to provide additional layers of protection if the AOA sensors provide erroneous data. The software was put through hundreds of hours of analysis, laboratory testing, verification in a simulator and two test flights, including an in-flight certification test with Federal Aviation Administration (FAA) representatives on board as observers.

The following changes will be made:

  • Flight control system will now compare inputs from both AOA sensors
  • If the sensors disagree by 5.5 degrees or more with the flaps retracted, MCAS will not activate
  • An indicator on the flight deck display will alert the pilots to AoA Disagree condition
    • This was previously a paid upgrade, but now will now ship as a standard feature
  • MCAS will also be disabled and if the AoA Disagree displayed with the AoA differs more than 10° for over 10 seconds during flight
  • If MCAS is activated in non-normal conditions, it will only provide one input for each elevated AOA event
    • There are no known or envisioned failure conditions where MCAS will provide multiple inputs.
  • MCAS can never command more stabilizer input than can be counteracted by the flight crew pulling back on the yoke.
    • The pilots will continue to always have the ability to override MCAS and manually control the airplane

In addition to the software changes, there are extensive training changes. Pilots will have to complete 21+ days of instructor-led academics and simulator training. Computer-based training will be made available to all 737 MAX pilots, which includes the MCAS functionality, associated crew procedures, and related software changes. Pilots will also be required to review the new documents:

  • Flight Crew Operations Manual Bulletin
  • Updated Speed Trim Fail Non-Normal Checklist
  • Revised Quick Reference Handbook

Boeing and the FAA participated in an evaluation of the software and 12 March test flight. Boeing will now work on getting the update approved for installation by the various airworthiness authorities around the world. I expect this to be a long road to approval after Boeing and the FAA destroyed their store of trust.

All of these actions seem correct to me as an engineer and systems builder. But I am crestfallen that they weren't included in the initial release.

Is This the Result of Bad Software?

It's very tempting to label the 737 MAX crashes as "caused by software." At some level, this is true. However, the MCAS appears to be a software patch applied to a larger systems problem (and a hastily assembled patch at that).

Let's walk through the chain that appears to have led us here:

  1. Fuel is expensive, and we want more efficient engines to reduce that burden
  2. Airbus was improving their aircraft, which placed pressure on Boeing to respond with their own improved platform
    1. The timeline was largely dictated by Airbus, not the time Boeing engineers needed to complete the project
  3. Boeing wanted to stick to the 737 platform for a variety of reasons:
    1. Faster time to market
    2. Lower cost for producing and certifying a new plane
    3. Pilot familiarity, leading to reduced training requirements for airlines
  4. Boeing sold the 737 MAX to airlines on the ideals of increased fuel efficiency, platform familiarity, and lower upgrade costs
  5. Bigger engines did not fit on the existing 737 platform, so modifications were needed:
    1. Move the engines forward
    2. Mount the engines higher
    3. Increase the height of the front landing gear
  6. These modifications changed the aerodynamics of the airplane, which should have changed certification requirements and required more training
  7. Instead Boeing created the MCAS to address the aerodynamic impact of the new design
  8. Boeing downplayed the MCAS system, which resulted in:
    1. Improper/insufficient certification
    2. Insufficient documentation
    3. Pilots received no training for handling the new 737 MAX

This is a systems engineering problem created by the company's design goals. Boeing's guiding light was to reuse the 737 platform so they could keep up with Airbus and minimize training requirements. Redesigning the airplane was entirely out of the question because it would give Airbus a significant time advantage and necessitate expensive training. To meet the design goals and avoid an expensive hardware change, Boeing created the MCAS as a software band-aid.

This scenario is quite familiar to me. As a firmware engineer, applying software workarounds for silicon or hardware design flaws is a major part of my work. Fixing hardware is "expensive" in terms of both time and money. At some point it's too late to change the hardware (or so I've been repeatedly told). The schedule drives the decision to move forward with known hardware design flaws.

The next line is predictable: "The problem will just have to be fixed in software." But software fixes do not always work. When the software workaround fails, we seem to forget that we were already attempting to hide a problem.

I am not alone in the view that this is not a "software problem". Trevor Sumner had an excellent Twitter Thread where he summarized the thoughts of Dave Kammeyer. Trevor's take extends beyond the Boeing analysis and even includes non-software factors leading to the Lion Air crashes (re-formatted for easier reading):

On both ill-fated flights, there was a:

  • Sensor problem. The AoA vane on the 737MAX appears to not be very reliable and gave wildly wrong readings. On #LionAir, this was compounded by a:
  • Maintenance practices problem. The previous crew had experienced the same problem and didn't record the problem in the maintenance logbook. This was compounded by a:
  • Pilot training problem. On LionAir, pilots were never even told about the MCAS, and by the time of the Ethiopian flight, there was an emergency AD issued, but no one had done sim training on this failure. This was compounded by an:
  • Economic problem. Boeing sells an option package that includes an extra AoA vane, and an AoA disagree light, which lets pilots know that this problem was happening. Both 737MAXes that crashed were delivered without this option. No 737MAX with this option has ever crashed. All of this was compounded by a:
  • Pilot expertise problem. If the pilots had correctly and quickly identified the problem and run the stab trim runaway checklist, they would not have crashed.

His closing point is austere (emphasis mine):

Nowhere in here is there a software problem. The computers & software performed their jobs according to spec without error. The specification was just shitty. Now the quickest way for Boeing to solve this mess is to call up the software guys to come up with another band-aid.

I've watched the "fix it in software" cycle play out repeatedly when developing iPhones. Should we be surprised that the same happens for an airplane too? What would prevent it, the idea of a safety culture? Can you ever be truly safe when you are optimizing for time-to-market and reduced costs.

After the resulting deaths, loss in market cap, and destruction of trust, one must wonder if Boeing will ever realize the cost and time savings they hoped the software fix would provide.

Note: We should leave open the possibility that there is a compounding software issue at play, since there are ASRS reports which indicate problems that occurred with autopilot on, a scenario where MCAS is supposed to be inactive.

Lessons We Can Apply to Our Systems

A complex system operated in an unexpected manner, and 346 people are dead as a result of two tragic and catastrophic accidents. Though the lives cannot be restored, if many systems and software engineers can learn as much as possible about this case, such deaths can be prevented in the future.

These are the lessons that I've learned from this investigation so far:

You Cannot Bend Complex Systems To Your Will

Boeing took an existing complex system and tried to change that system to force a specific outcome. Systems thinkers everywhere are cringing at this, because all changes to complex systems have unintended consequences.

Donna Meadows said in "Dancing with Systems":

But self-organizing, nonlinear, feedback systems are inherently unpredictable. They are not controllable. They are understandable only in the most general way. The goal of foreseeing the future exactly and preparing for it perfectly is unrealizable. The idea of making a complex system do just what you want it to do can be achieved only temporarily, at best. We can never fully understand our world, not in the way our reductionistic science has led us to expect. Our science itself, from quantum theory to the mathematics of chaos, leads us into irreducible uncertainty. For any objective other than the most trivial, we can’t optimize; we don’t even know what to optimize. We can’t keep track of everything. We can’t find a proper, sustainable relationship to nature, each other, or the institutions we create, if we try to do it from the role of omniscient conqueror.

Donna continues:

Systems can’t be controlled, but they can be designed and redesigned. We can’t surge forward with certainty into a world of no surprises, but we can expect surprises and learn from them and even profit from them. We can’t impose our will upon a system. We can listen to what the system tells us, and discover how its properties and our values can work together to bring forth something much better than could ever be produced by our will alone.

These thoughts are echoed by Dr. Russ Ackoff in a short talk titled "Beyond Continual Improvement". The points he makes in that brief fifteen minutes repeatedly echoed in my head while writing this essay.

A system is not the sum of the behavior of its parts, it is a product of their interactions. The performance of a system depends on how the parts fit, not how they act taken separately.

Boeing changed a few individual parts of the plane and expected the overall performance to be improved. But the effect on the overall system was more complex than the changes led them to expect.

When you get rid of something you don’t want (remove a defect), you are not guaranteed to have it replaced with what you do what.

We are all familiar with the experience of fixing a bug, only to have a new bug (or several) appear as a result of our fix.

Finding and removing defects is not a way to improve the overall quality or performance of a system.

The larger engines on the 737 airframe resulted in undesirable flight characteristics (excessive upward pitch at steep AoA). Boeing responded by attempting to address this defect with the MCAS. It's clear that the MCAS does not unilaterally improve the overall quality or performance of the aircraft.

What aspects of your system are you trying to force? Perhaps you can broaden your perspective and look at different approaches. The answer will reveal itself if you listen, though you might have to head in a different direction than you orginally intended.

Where You are Aiming is the Most Important Thing

There is an idea that I've been holding in the forefront of my mind: nothing has more of an impact on where you will eventually end up as where you are aiming. Setting the right aim is the most important thing.

It seems to me that Boeing's aim was to keep up with Airbus, leading to an aggressive time-to-market. They also wanted to minimize changes to ease certification and ensure that pilots did not need to receive new training. Those are the principles that appear to have guided their actions. Safety was still a concern, but that is not what the organization, system, or schedule focused on.

Dr. Ackoff echoes this idea in "Beyond Continual Improvement"

Basic principle: an improvement program must be directed at what you want, not at what you don’t want

At one level, we can say that Boeing wanted a new aircraft with improved fuel efficiency to compete with Airbus.

At another level, what Boeing wanted was to design a new aircraft with improved fuel efficiency, but in such a way as to not require a new airframe design, to not require a timeline that delayed them significantly with regards to the Airbus launch, and to not require pilots to receive training on the new airplane.

Boeing seems to have focused heavily on the things they did not want out of the improved design.

If you stick to the base level of desire (wanting a new aircraft with improved fuel efficiency), it seems that the system needed to be largely redesigned with a new airframe to support larger engines.

Your company’s aim is a truly powerful force. Your organization is headed in only that direction.

Ask yourselves often: is it the proper aim?

Treat Documentation as a First Class Citizen

If other people will use your product, you need to treat documentation as a first class citizen. Useful and comprehensive documentation and training is extremely important to your users and the engineers and managers that come after you.

Pilots are fanatical about their documentation, as well they should be. There is clear and documented outrage that details were kept from them.

In this case, improved documentation would have led to better understanding of the system forces at work. Improved documentation alone could have potentially saved hundreds of lives.

We try to hold back because we think our users don't need (or can't handle) the details:

One high-ranking Boeing official said the company had decided against disclosing more details to cockpit crews due to concerns about inundating average pilots with too much information - and significantly more technical data - than they needed or could digest.

Software teams often take this view of their users. Perhaps it is simply a rationalization for not wanting to put the effort into creating and maintaining documentation. How can we predict what information people need to know? What is too technical, and what is enough information? Won't the details change as the system evolves? How will we keep it maintained?

When we leave out documentation or fudge the explanations of how things work, we hinder our users. What could your users accomplish with your system if they had a full understanding of how it worked? I guarantee they can handle and achieve much more than you expect.

Software teams also hinder themselves when they neglect documentation. When we document, we are acting as explorers, mapping uncharted territory. New team members can learn how the system is designed. Ideas for simplification will jump out at you. You'll start thinking about novel ways to use your software and the edge cases that will be encountered. Poorly understood system aspects are suddenly obvious - "here be dragons".

It's a popular adage: if you can't explain something in simple terms, you don't understand it. And if you don't explain something, nobody else has a chance of understanding it.

Keep Humans in the Loop

I stated earlier that humans must maintain the ability to override or overpower an automated process. Because of our superiority at dynamic information collection and synthesis, we can improvise and make novel decisions in response to new situations. A computer, which has been preprogrammed to read from a limited amount of information and perform a set of specific responses, is not (yet) capable of improvising.

“What we have here is a ‘failure of the intended function,’ going back to your recent piece [on SOTIF — Safety of the Intended Functionality]. Barnden said, “A plane shouldn’t fight the pilot and fly into the ground. This is happening after decades of R&D into aviation automation, cockpit design and human factors research in planes

System designers and programmers are not all-knowing. Make sure that humans are kept in the loop - let them override your automated processes. Perhaps they know better after all.

Testing Doesn't Mean You Are Safe

Phil Koopman recently wrote about a concept he calls The Insufficient Testing Pitfall:

Testing less than the target failure rate doesn't prove you are safe. In fact you probably need to test for about 10x the target failure rate to be reasonably sure you've met it. For life critical systems this means too much testing to be feasible.

No doubt about it: the airplane and software were tested. Probably significantly. Certainly in simulators and in test flights. But it seems that Boeing did not test the system enough to encounter these problems. And even if they did - what other problems would still be missed?

We need a plan for proving that our software works safely. Testing is not enough.

Could This Happen in Your Organization?

It's easy for us to read about the Boeing 737 MAX saga, or other similar human-caused disasters, and think that we would never have walked down the path that led there. I implore you to have sympathy and understanding. Humans committed those actions. You are also human. You (and the organizations you are a part of) are capable of the same actions, for the same reasons. Keep the possibility of catastrophe in mind when you are tempted to let standards slide.

All of this is familiar to me as an engineer. I've worked on many fast-paced engineering projects. I've observed and personally made compromises to meet deadlines: some I proposed myself, and others that I disagreed with. I've seen these compromises work out, and I've seen them fail spectacularly. I got lucky. I don't work on safety critical software, and I have never watched people die at the hands of my systems. I have deep sympathy for the engineers who will be forever plagued by their creation.

After the Lion Air crash, Boeing offered trauma counseling to engineers who had worked on the plane. “People in my group are devastated by this,” said Mr. Renzelmann, the former Boeing technical engineer. “It’s a heavy burden.”

We must also remember that nobody at Boeing wanted to trade human lives for increased profits. All human organizations - families, companies, industries, governments - are complex systems and have a life of their own. The organization can make and execute a decision which none of the participants truly want, such as shipping a compromised product or prioritizing profits over safety.

What I see with Boeing is an organization that made the same kind of decisions that I regularly see made at every organization I've been a part of. Like at all of these other organizations, they did not escape the consequences of their decisions. The difference for Boeing is that they were playing for bigger stakes, and the result of their misplaced bet is more painful.

There was not villainous a CEO who forced his minions to compromise the product. There was not an entire organization whose individuals decided to collectively disregard safety. The organization rallied around the goals of time-to-market and minimizing required pilot training. Momentum and inertia kept the company marching toward their aim, even if individuals disagreed. And perhaps nobody explicitly noticed that safety was de-prioritized as a result.

I want to repeat this: Boeing made the same decisions that are being made everywhere else.

We all have a duty to aim higher.

Further Reading

For more on the Boeing 737 MAX Saga:

Commentary on the situation:

Thoughts on Autonomy and Safety:

Acknowledgments

Our creations are never the result of a single mind.

I want to thank Rozi Harris and Stephen Smith for reviewing early drafts of this essay. Their feedback, conversation, and exploration of the topics at hand has been extremely helpful. Many of their discussion points were incorporated into the essay.

Thanks to Nicole Radziwil for reviewing the article and making edits and corrections.

Thank you to the hard-working journalists and aviation fanatics who have published brilliant coverage and analysis for the 737 MAX saga. I know only a fraction of what others know about the problems discussed herein.

I also want to thank all of my colleagues who stood beside me over the years. It takes a monumental effort to build something new, and it rarely works out. We should all be amazed at our combined human triumph.

The lessons I present are hard-won, collectively generated, and the result of long debates. I hope the next generation of creators can use them to move beyond our current capabilities.

Hypotheses on Systems and Complexity

A famous John Gall quote from Systemantics became known as Gall's Law. The law states:

A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system.

I've always felt the truth of this idea. Gall's Law inspired me to think about the evolution of complexity in systems from different perspectives. I've developed five hypotheses in this area:

  1. A simple system that works (and is maintained) will inevitably grow into a complex system.
  2. The tendency of the Universal System is a continual increase in complexity.
  3. A simple system must increase in complexity or it is doomed to obsoletion and/or failure.
  4. A system's complexity level starts at the complexity of the local system/environment in which it participates.
  5. A working system will eventually collapse due to unmanageable complexity.

I call these ideas "hypotheses" because they are born of late-night thoughts while watching my newborn child. They have not been put through sufficient research or testing for me to call them "axioms", "laws", or "rules of thumb". These ideas may already exist in the systems cannon, but I have not yet encountered them.

The Hypotheses in Detail

Let's look at each of these hypotheses in turn, then we can discuss their implications for our projects.

Hypothesis 1: Simple Systems Become Complex

My first hypothesis is fully stated as follows:

A simple system that works (and is maintained) will inevitably grow into a complex system.

This is a restatement of Gall's Law from a different perspective. I believe that a working simple system is destined to become more complex.

This hypothesis is opposed to another systems maxim (quoted from Of Men and Laws):

A working system (and by happy accident, systems sometimes work) should be left alone.

Unfortunately, this recommendation is untenable for two reasons:

  1. Human beings are not disciplined enough to leave a working system alone.
  2. If a working system is not maintained, it will inevitably become obsolete according to Hypothesis 3.

Humans are the ultimate tinkerers. We are never satisfied with the status quo. We have the tendency to expand or modify a system's features and behaviors once we consider it to be "working" (and even if it's not working). Our working systems are destined to increase in complexity thanks to our endless hunger.

Hypothesis 2: Universal complexity is always increasing

My second hypothesis is fully stated as follows:

The tendency of the Universal System is a continual increase in complexity.

At its core, I believe that Hypothesis 2 is simply a restatement of the Second Law of Thermodynamics, but I include it for use with other hypotheses below.

The Second Law of Thermodynamics states that the total entropy of an isolated system can never decrease over time. Thanks to the Second Law of Thermodynamics, all processes in the universe trigger an irreversible increase in the total entropy of a system and its surroundings.

Rudolf Clausius provides us with another perspective on the Second Law of Thermodynamics:

[...] we may express in the following manner the fundamental laws of the universe which correspond to the two fundamental theorems of the mechanical theory of heat.

  1. The energy of the universe is constant.
  2. The entropy of the universe tends to a maximum.

I have an inkling that complexity and entropy are closely related concepts, if not actually the same. As such, I assume that the complexity of the Universal System will increase over time.

The reason that I think complexity increases over time is that I can observe this hypothesis in other sciences and directly in the world around me:

  • After the big bang, simple hydrogen coalesced into stars (and planets and solar systems and galaxies), forming increasingly complex elements as time progressed
  • Life progressed from simple single-celled organisms to complex networked species consisting of hundreds of sub-systems
  • Giving birth progressed from a natural, body-driven affair to one of complex rituals that is carried out by a large team of experts at great cost in specialized locations (i.e., hospitals)
  • Finance has progressed from exchanging metal coins and shells to a complex, automated, digitized, international system of rules and cooperating systems

Corollary: Complexity must be preserved

The idea exists that complexity can be reduced:

An evolving system increases its complexity unless work is done to reduce it.
-- Meir Lehman

Or:

Ongoing development is the main source of program growth, but programs are also entropic. As they age, they tend to become more cluttered. They get larger and more complicated unless pressure is applied to make them simpler.
-- Jerry Fitzpatrick

Because of the Second Law of Thermodynamics, we cannot reverse complexity. We are stuck with the existing environment, requirements, behaviors, expectations, customers, resources, etc.

Energy must be invested to perform any "simplification" work, which means that there is a complexity-entropy increase in some part of the system. Perhaps you successfully "simplified" your product's hardware design so that it's easier to assemble in the factory. What other sub-systems saw increased complexity as a result: supply chain, tooling design, engineering effort, mechanical design, repairability?

Complexity must be preserved - we only move it around within the system.

Hypothesis 3: Simple Systems Must Evolve

Hypotheses 1 and 2 combine into a third hypothesis:

A simple system must increase in complexity or it is doomed to obsoletion and/or failure.

The systems we create are not isolated; they are always interconnected with other systems. And as one of John Gall's "Fundamental Postulates of General Systemantics" states, "Everything is part of a larger system."

The Universal System is always increasing in complexity-entropy, as are all subsystems by extension. Because of the ceaseless march toward increased complexity, systems are forced to adapt to changes in the complexity of the surrounding systems and environment. Any system which does not evolve will eventually be unable to cope with the new level of complexity and will implode.

The idea of "code rot" demonstrates this idea:

Software rot, also known as code rot, bit rot, software erosion, software decay or software entropy is either a slow deterioration of software performance over time or its diminishing responsiveness that will eventually lead to software becoming faulty, unusable, or otherwise called "legacy" and in need of upgrade. This is not a physical phenomenon: the software does not actually decay, but rather suffers from a lack of being responsive and updated with respect to the changing environment in which it resides.

I've seen it happen enough on my own personal projects. You can take a working software project without errors, put it into storage, pull it out years later, and it will no longer compile and run. This could be for any number of reasons: the language changed, the compiler is no longer available, libraries or tooling needed to build and use the software is no longer available, the underlying processor architectures have changed, etc.

Our "simple" systems will never truly remain so. They must be continually updated to remain relevant.

Hypothesis 4: "Simple" is Determined by Local Complexity

Hypothesis 2 drives the fourth hypothesis:

A system's complexity level starts at the complexity of the local system/environment in which it participates.

Stated in another way:

A system cannot have lower complexity than the local system in which it will participate.

Hypothesis 2 indicates that a local (and universal) lower bound for simplicity exists. Stated another way, your system has to play by the rules of other systems it interacts with. The more external systems your system must interact with, the more complex the starting point.

We can see this by looking at the world around us. Consider an example of payment processing. You can't start over with a "simple" payment application: the global system is to complex and has too many specific requirements. There are banking regulations, credit card regulations, security protocols, communication protocols, authentication protocols, etc. Your payment processor must work with the existing banking ecosystem.

Now, you could ignore these requirements and create a new payment system altogether (e.g., Bitcoin), but you are not actually participating in the same local system (international banking). Even still, the Universal System's complexity is higher than your system's local complexity, and players know the game. You can skip the authentication requirements or other onerous burdens, but external actors can still take advantage of your system (e.g., Bitcoin thefts, price manipulation, lost keys leading to un-claimable money).

Once complexity has developed, we are stuck with it. We can never return to simplicity. I can imagine a time when the Universal System's complexity level will be so high that humans will no longer have the capacity to create or manage any systems.

Hypothesis 5: Working Systems Eventually Collapse

Hypothesis 5 is fully stated as follows:

A working system will eventually collapse due to unmanageable complexity.

Complexity is always increasing, and there is nothing we can do to stop it. There are two complexity-related failure modes for our system:

  1. Our system becomes so complex that we can no longer maintain it (there are no humans who can understand and master the system)
  2. Our system cannot adapt fast enough to keep up with the local/universal system's increases in complexity

While we cannot forever prevent the collapse of our system, we can impact the timeframe through system design and complexity management efforts. We can strive to reduce the rate of complexity increase to a minimal amount. However, as the complexity of the system increases, the effort required to sustain the system also increases. As time goes on, our systems require more energy to be spent on documentation, hiring, training, refactoring, and maintenance.

We can see systems all around us which become too complex to truly understand (e.g., the stock market). Unfortunately, Western governments seem to be reaching a complexity breaking point, as they have become so complex they can't enact policy. To quote Matt Levine's Money Stuff newsletter:

What if your model is that democratic political governance has just stopped working—not because you disagree with the particular policies that particular elected governments are carrying out, but because you have started to notice that elected governments in large developed nations are increasingly unable to carry out any policies at all?

Perhaps unmanageable complexity doomed the collapsed civilizations that preceded us. Given that thought, what is the human race's limit on complexity management? We've certainly extended our ability to handle complexity through the development of computers and algorithms, but there will come a time when the complexity is too much for us to handle.

Harnessing these ideas

These five hypotheses are one master hypothesis broken into different facets which we can analyze. The overall hypothesis is:

The Second Law of Thermodynamics tells us that our systems are predestined to increase in complexity until they fail, become too complex to manage, or are made obsolete. We can manage the rate of increase of complexity, but never reverse it.

The hypotheses described herein do not contradict the idea that our systems should be kept as simple as possible. Simplicity is still an essential goal. However, we must realize that the increase in complexity is inevitable and irreversible. We must actively work to prevent complexity from increasing faster than we can manage it.

Here are some key implications of these ideas for system builders:

  • If your system isn’t continually evolving and increasing in complexity, it will collapse
  • You can extend the lifetime of your system by investing energy to manage system complexity
  • You can extend the lifetime of your system by continually introducing and developing new acolytes who understand and can maintain your system
    • This enables collective management of complexity and transfer of knowledge about the system
  • You can extend the lifetime of your system by giving others the keys to understanding your system (documentation, training)
    • This enables others to come to terms with the complexity of your system
  • You can never return to "simplicity" - don't consider a "total rewrite" effort unless you are prepared to scrap the entire system and begin again
  • These hypotheses speak to why documentation becomes such a large burden
    • Documentation becomes part of the overall system's complexity, requiring a continual increase in resources devoted to managing it

Developing a skillset in Complexity Management is essential for system designers and maintainers.

Further Reading

Related Articles

Related Books