We never learned about significant software accidents and errors during our education or professional careers. Sometimes, we would see an event in the news and follow it, but that was often the extent of our learning. Rarely did we find that lessons were codified and applied to our work.
My original account of the Therac-25 losses said that accidents are seldom simple. They usually involve a complex web of interacting events with multiple contributing technical, human, organizational, and regulatory factors. We aren’t learning enough today from the events nor focusing enough on preventing them. It’s time for computer science practitioners to be better educated about engineering for safety.
– Nancy Leveson, “The Therac-25: 30 Years Later”
As our world grows more dependent on software, developers everywhere need to study historical software problems, their causes, and the implications for how we build our software. We cannot afford to keep repeating the same mistakes, especially those that cause harm.
Table of Contents:
- Learning from Case Studies
- General Overviews
- Major Accidents and Errors
- Minor Accidents and Errors
- Security Vulnerabilities
Learning from Case Studies
If you have not studied case studies before, you might be stumped at how to draw useful lessons from them. We recommend these entries:
General Overviews
- Killer Apps: Embedded Software’s Greatest Hit Jobs (slides), by Michael Barr, surveys some of the accidents described below
- Key Learnings from Past Safety-Critical Failures, by Michael Barr and Dan Smith, surveys some of the accidents described below and outlines lessons learned.
- The IoT Hall-of-Shame documents security vulnerabilities and problems with internet-connected embedded systems
- Safe Autonomy’s Computer-Based System Safety Essential Reading List provides links to a number of case studies. The primary focus is on safety-related accidents.
- Real Systems Failures discusses real-world failures observed in safety and security critical systems and the lessons learned. This resource sources from Honeywell engineers, Navy & Marine corps publications, “System Safety” e-mail list, and NASA’s “Significant Incidents and Close Calls in Human Space Flight‘
- Learning from Engineering Failures, by Steve Branam, references a number of failures and other resources
Major Accidents and Errors
The following software-related accidents and errors resulting in loss of life or significant economic impact:
- The Therac-25 deaths are a canonical example of software accidents. Two different errors caused multiple patients to receive massive overdoses of radiation, resulting in serious injuries or death.
- An Iraqi Scud missile hit baracks in Dhahran, Saudi Arabia, after a Patriot missile defense system failed to intercept the missile. The accident resulted in 28 U.S. soldiers killed and 98 soldiers wounded. The failure to intercept the missile was caused by a compounding software clock drift error resulting in a distance calculation error of 687 meters. Because of the drift and corresponding distance offset, the system determined that the missile was on a spurious track and did not fire.
- Stuxnet was a virus that targeted Siemens PLCs and was responsible for significantly damaging Iran’s nuclear program. Stuxnet caused centrifuges at the nuclear plant to spin out of control while operator screens reported nominal values, leading to systems failures, asset damage, safety concerns, and a national security fiasco.
- Unintended acceleration, or the loss of driver control over engine power, in Toyota cars is suspected in the deaths of at least 89 people and injuries to at least 57 more (with hundreds of additional cases being settled). Toyota and the Department of Transportation historically cited “driver error” or “stuck acceleration pedals” due to floor mats as the cause. The Barr Group determined that the Electronic Throttle Control System (ETCS) source code was of “unreasonable quality”and contained bugs that could cause unintended acceleration, that the fail safes were defective and inadequate, and that Toyota did not comply with standards, both internal and external.
- The Boeing 737 MAX-8 and MAX-9 aircraft were grounded after Ethiopian Airlines and Lion air crashes both resulted in the deaths of everyone on board. The implicated system is the the Maneuvering Characteristics Augmentation System (MCAS), which is part of the flight management computer software. The MCAS was designed to correct for an increased potential to stall the plane due to mechanical design changes. When fed an Angle-of-Attack reading from a bad sensor, the MCAS triggered at an improper time, forcing the plane nosedown and overriding pilot input.
- On 19 March, 2019, a woman was struck and killed by an Uber autonomous vehicle operating in Tempe, Arizona. The pedestrian was jaywalking at the time of the crash, the vehicle operator was distracted by her phone, and automated collision detection and braking systems supplied by the auto manufacturer were disabled by Uber.
Minor Accidents and Errors
The following accidents and errors are “minor” because they primarily resulted in malfunctioning equipment or annoyances that did not result in loss of life or significant economic damages. Linked entries contain key lessons from these events.
- In 2016, Muddy Waters announced security vulnerabilities in implantable St. Jude pacemakers. These vulnerabilities could result in a crash of the device or increased battery drain. The FDA issued a “recall” (for a firmware update) in 2017. Thankfully, no related deaths have been reported.
- In April 2019, several systems failed to handle a known GPS date stamp rollover, even though this is the second GPS rollover that has occurred since the protocol was developed.
- In 2019, many Tesla repair professionals reported that Tesla Model S and X Media Control Units were wearing out their eMMC memory. When the eMMC fails, drivers cannot use the in-vehicle display, climate control, autopilot, lighting control, or vehicle charging. This issue was caused by excessive logging to the eMMC, which has a limited number of write cycles.
- A long-standing bug in the Advanced Radiation Detection Capability Unit, a system used to detect nuclear explosions, was identified and resolved thanks to a developer who worked through the math by hand to identify the problem.
Security Vulnerabilities
The following security vulnerabilities provide us with useful lessons.
- Sweyntooth is a family of 12 public BLE vulnerabilities, as well as other non-disclosed vulnerabilities, that result in crashes, deadlocks, and bypassing BLE secure connections. This vulnerability exposed flaws in vendor security testing processes and BLE certification testing.
- BootHole was a vulnerability in the popular GRUB2 bootloader that is widely used on Linux and Windows systems. A buffer overflow when reading the
grub.cfg
file enabled execution of arbitrary code and enabled attackers to bypass secure boot.