Historical Software Accidents and Errors

We never learned about significant software accidents and errors during our education or professional careers. Sometimes, we would see an event in the news and follow it, but that was often the extent of our learning. Rarely did we find that lessons were codified and applied to our work.

Quote

My original account of the Therac-25 losses said that accidents are seldom simple. They usually involve a complex web of interacting events with multiple contributing technical, human, organizational, and regulatory factors. We aren’t learning enough today from the events nor focusing enough on preventing them. It’s time for computer science practitioners to be better educated about engineering for safety.
– Nancy Leveson, “The Therac-25: 30 Years Later”

As our world grows more dependent on software, developers everywhere need to study historical software problems, their causes, and the implications for how we build our software. We cannot afford to keep repeating the same mistakes, especially those that cause harm.

Table of Contents:

  1. Learning from Case Studies
  2. General Overviews
  3. Major Accidents and Errors
  4. Minor Accidents and Errors
  5. Security Vulnerabilities

Learning from Case Studies

If you have not studied case studies before, you might be stumped at how to draw useful lessons from them. We recommend these entries:

General Overviews

Major Accidents and Errors

The following software-related accidents and errors resulting in loss of life or significant economic impact:

  1. The Therac-25 deaths are a canonical example of software accidents. Two different errors caused multiple patients to receive massive overdoses of radiation, resulting in serious injuries or death.
  2. An Iraqi Scud missile hit baracks in Dhahran, Saudi Arabia, after a Patriot missile defense system failed to intercept the missile. The accident resulted in 28 U.S. soldiers killed and 98 soldiers wounded. The failure to intercept the missile was caused by a compounding software clock drift error resulting in a distance calculation error of 687 meters. Because of the drift and corresponding distance offset, the system determined that the missile was on a spurious track and did not fire.
  3. Stuxnet was a virus that targeted Siemens PLCs and was responsible for significantly damaging Iran’s nuclear program. Stuxnet caused centrifuges at the nuclear plant to spin out of control while operator screens reported nominal values, leading to systems failures, asset damage, safety concerns, and a national security fiasco.
  4. Unintended acceleration, or the loss of driver control over engine power, in Toyota cars is suspected in the deaths of at least 89 people and injuries to at least 57 more (with hundreds of additional cases being settled). Toyota and the Department of Transportation historically cited “driver error” or “stuck acceleration pedals” due to floor mats as the cause. The Barr Group determined that the Electronic Throttle Control System (ETCS) source code was of “unreasonable quality”and contained bugs that could cause unintended acceleration, that the fail safes were defective and inadequate, and that Toyota did not comply with standards, both internal and external.
  5. The Boeing 737 MAX-8 and MAX-9 aircraft were grounded after Ethiopian Airlines and Lion air crashes both resulted in the deaths of everyone on board. The implicated system is the the Maneuvering Characteristics Augmentation System (MCAS), which is part of the flight management computer software. The MCAS was designed to correct for an increased potential to stall the plane due to mechanical design changes. When fed an Angle-of-Attack reading from a bad sensor, the MCAS triggered at an improper time, forcing the plane nosedown and overriding pilot input.
  6. On 19 March, 2019, a woman was struck and killed by an Uber autonomous vehicle operating in Tempe, Arizona. The pedestrian was jaywalking at the time of the crash, the vehicle operator was distracted by her phone, and automated collision detection and braking systems supplied by the auto manufacturer were disabled by Uber.

Minor Accidents and Errors

The following accidents and errors are “minor” because they primarily resulted in malfunctioning equipment or annoyances that did not result in loss of life or significant economic damages. Linked entries contain key lessons from these events.

  1. In 2016, Muddy Waters announced security vulnerabilities in implantable St. Jude pacemakers. These vulnerabilities could result in a crash of the device or increased battery drain. The FDA issued a “recall” (for a firmware update) in 2017. Thankfully, no related deaths have been reported.
  2. In April 2019, several systems failed to handle a known GPS date stamp rollover, even though this is the second GPS rollover that has occurred since the protocol was developed.
  3. In 2019, many Tesla repair professionals reported that Tesla Model S and X Media Control Units were wearing out their eMMC memory. When the eMMC fails, drivers cannot use the in-vehicle display, climate control, autopilot, lighting control, or vehicle charging. This issue was caused by excessive logging to the eMMC, which has a limited number of write cycles.
  4. A long-standing bug in the Advanced Radiation Detection Capability Unit, a system used to detect nuclear explosions, was identified and resolved thanks to a developer who worked through the math by hand to identify the problem.

Security Vulnerabilities

The following security vulnerabilities provide us with useful lessons.

  1. Sweyntooth is a family of 12 public BLE vulnerabilities, as well as other non-disclosed vulnerabilities, that result in crashes, deadlocks, and bypassing BLE secure connections. This vulnerability exposed flaws in vendor security testing processes and BLE certification testing.
  2. BootHole was a vulnerability in the popular GRUB2 bootloader that is widely used on Linux and Windows systems. A buffer overflow when reading the grub.cfg file enabled execution of arbitrary code and enabled attackers to bypass secure boot.

Share Your Thoughts

This site uses Akismet to reduce spam. Learn how your comment data is processed.