18 August 2020 by Phillip Johnston • Last updated 15 August 2023Stuxnet was a virus that targeted Siemens PLCs and was responsible for significantly damaging Iran’s nuclear program. The virus was first uncovered in 2010, but is thought to have been in development for at least five years prior to that. Stuxnet caused centrifuges at the nuclear plant to spin out of control while operator screens reported nominal values, leading to systems failures, asset damage, safety concerns, and a national security fiasco. This virus is particularly remarkable because it attacked systems that were not connected to any networks with access …
Continue reading “Case Study: StuxNet”
18 August 2020 by Phillip Johnston • Last updated 15 August 2023465,000 U.S. patients were been told to visit a clinic to receive a firmware update for their St. Jude pacemakers. The firmware contains a security flaw which allows hackers within radio range to take control of a vulnerable pacemaker. Identified attacks include: “Crash attacks”, which involve broadcasting a combination of signals that place cardiac devices into a state of malfunction “Battery drain attacks”, which generates signals from the Merlin@home device to run down batteries in a Cardiac Device at a “greatly accelerated rate”. While St. Jude and other parties, …
Continue reading “Case Study: St. Jude Pacemaker Recall”
18 August 2020 by Phillip Johnston • Last updated 10 June 2021The Boeing 737 MAX-8 and MAX-9 aircraft were grounded after Ethiopian Airlines and Lion air crashes both resulted in the deaths of everyone on board. The implicated system is the the Maneuvering Characteristics Augmentation System (MCAS), which is part of the flight management computer software. The MCAS was designed to correct for an increased potential to stall the plane due to mechanical design changes. When fed an Angle-of-Attack reading from a bad sensor, the MCAS triggered at an improper time, forcing the plane nosedown and overriding pilot input. The …
Continue reading “Case Study: Boeing 737 MAX Crashes”
18 August 2020 by Phillip Johnston • Last updated 13 September 2022Unintended acceleration, or the loss of driver control over engine power, in Toyota cars is suspected in the deaths of at least 89 people and injuries to at least 57 more (with hundreds of additional cases being settled). Toyota, the Department of Transportation, the U.S. National Highway Traffic Safety Administration, and journalists cited “driver error” or “stuck acceleration pedals” due to floor mats as the primary cause. The official finding of the joint NHTSA and NASA investigation confirmed this opinion. NASA’s team did find one theoretical way for Toyota’s …
Continue reading “Case Study: Toyota Unintended Acceleration”
18 August 2020 by Phillip Johnston • Last updated 15 August 2023An Iraqi Scud missile hit baracks in Dhahran, Saudi Arabia, after a Patriot missile defense system failed to intercept the missile. The accident resulted in 28 U.S. soldiers killed and 98 soldiers wounded. The failure to intercept the missile was caused by a compounding software clock drift error resulting in a distance calculation error of 687 meters. Because of the drift and corresponding distance offset, the system determined that the missile was on a spurious track and did not fire. Further Information For more on the event and the …
Continue reading “Case Study: Patriot Missile Failure at Dhahran”
18 August 2020 by Phillip Johnston • Last updated 13 September 2022The Therac-25 deaths are a canonical example of software accidents. Two different errors caused multiple patients to receive massive overdoses of radiation, resulting in serious injuries or death. Case Studies This is a well documented accident, so we will refer you to the following sources for understanding what went wrong: Summary Video by Phil Koopman Wikipedia: Therac-25 IEEE: An Investigation of the Therac-25 Accidents, by Nancy Leveson and Clark Turner, is one of the original investigatory articles published on the topic Medical Devices: The Therac-25, by Nancy Leveson, is an updated and …
Continue reading “Case Study: Therac-25”
29 April 2020 by Phillip Johnston • Last updated 15 August 2023The Power of Ten is a popular set of coding “rules” for writing safety critical software that originally appeared in IEEE Computer in June 2006. These rules have been floating around for a while, and the odds are good that you’ve heard someone mention them. Even they’re they’re nominally related to safety critical software, they are excellent rules to follow for general embedded systems development. Here are the 10 rules: Restrict to simple control flow constructs. Give all loops a fixed upper-bound. Do not use dynamic memory allocation after …
Continue reading “Coding Standard: Power of Ten”
27 January 2020 by Phillip Johnston • Last updated 15 August 2023On 19 March, 2019, a woman was struck and killed by an Uber autonomous vehicle operating in Tempe, Arizona. The pedestrian was jaywalking at the time of the crash, the vehicle operator was distracted by her phone, and automated collision detection and braking systems supplied by the auto manufacturer were disabled by Uber. This case study is part of our analysis of historical software accidents and errors. To make the situation more disheartening, the self-driving software identified the pedestrian “object” 5.6 seconds before impact. Because the system did not …
Continue reading “Case Study: Uber ATG Crash in Tempe, Arizona”
6 November 2019 by Phillip Johnston • Last updated 20 September 2022
We never learned about significant software accidents and errors during our education or professional careers. Sometimes, we would see an event in the news and follow it, but that was often the extent of our learning. Rarely did we find that lessons were codified and applied to our work.
My original account of the Therac-25 losses said that accidents are seldom simple. They usually involve a complex web of interacting events with multiple contributing technical, human, organizational, and regulatory factors. We aren’t learning enough today from the events nor focusing enough on preventing them. It’s time for computer science practitioners to be better educated about engineering for safety.
– Nancy Leveson, “The Therac-25: 30 Years Later”
As our world grows more dependent on software, developers everywhere need to study historical software problems, their causes, and the implications for how we build our software. We cannot afford to keep repeating the same mistakes, especially those that cause harm.
Table of Contents:
- Learning from Case Studies
- General Overviews
- Major Accidents and Errors
- Minor Accidents and Errors
- Security Vulnerabilities
Learning from Case Studies
If you have not studied case studies before, you might be stumped at how to draw useful lessons from them. We recommend these entries:
General Overviews
Major Accidents and Errors
The following software-related accidents and errors resulting in loss of life or significant economic impact:
- The Therac-25 deaths are a canonical example of software accidents. Two different errors caused multiple patients to receive massive overdoses of radiation, resulting in serious injuries or death.
- An Iraqi Scud missile hit baracks in Dhahran, Saudi Arabia, after a Patriot missile defense system failed to intercept the missile. The accident resulted in 28 U.S. soldiers killed and 98 soldiers wounded. The failure to intercept the missile was caused by a compounding software clock drift error resulting in a distance calculation error of 687 meters. Because of the drift and corresponding distance offset, the system determined that the missile was on a spurious track and did not fire.
- Stuxnet was a virus that targeted Siemens PLCs and was responsible for significantly damaging Iran’s nuclear program. Stuxnet caused centrifuges at the nuclear plant to spin out of control while operator screens reported nominal values, leading to systems failures, asset damage, safety concerns, and a national security fiasco.
- Unintended acceleration, or the loss of driver control over engine power, in Toyota cars is suspected in the deaths of at least 89 people and injuries to at least 57 more (with hundreds of additional cases being settled). Toyota and the Department of Transportation historically cited “driver error” or “stuck acceleration pedals” due to floor mats as the cause. The Barr Group determined that the Electronic Throttle Control System (ETCS) source code was of “unreasonable quality”and contained bugs that could cause unintended acceleration, that the fail safes were defective and inadequate, and that Toyota did not comply with standards, both internal and external.
- The Boeing 737 MAX-8 and MAX-9 aircraft were grounded after Ethiopian Airlines and Lion air crashes both resulted in the deaths of everyone on board. The implicated system is the the Maneuvering Characteristics Augmentation System (MCAS), which is part of the flight management computer software. The MCAS was designed to correct for an increased potential to stall the plane due to mechanical design changes. When fed an Angle-of-Attack reading from a bad sensor, the MCAS triggered at an improper time, forcing the plane nosedown and overriding pilot input.
- On 19 March, 2019, a woman was struck and killed by an Uber autonomous vehicle operating in Tempe, Arizona. The pedestrian was jaywalking at the time of the crash, the vehicle operator was distracted by her phone, and automated collision detection and braking systems supplied by the auto manufacturer were disabled by Uber.
Minor Accidents and Errors
The following accidents and errors are “minor” because they primarily resulted in malfunctioning equipment or annoyances that did not result in loss of life or significant economic damages. Linked entries contain key lessons from these events.
- In 2016, Muddy Waters announced security vulnerabilities in implantable St. Jude pacemakers. These vulnerabilities could result in a crash of the device or increased battery drain. The FDA issued a “recall” (for a firmware update) in 2017. Thankfully, no related deaths have been reported.
- In April 2019, several systems failed to handle a known GPS date stamp rollover, even though this is the second GPS rollover that has occurred since the protocol was developed.
- In 2019, many Tesla repair professionals reported that Tesla Model S and X Media Control Units were wearing out their eMMC memory. When the eMMC fails, drivers cannot use the in-vehicle display, climate control, autopilot, lighting control, or vehicle charging. This issue was caused by excessive logging to the eMMC, which has a limited number of write cycles.
- A long-standing bug in the Advanced Radiation Detection Capability Unit, a system used to detect nuclear explosions, was identified and resolved thanks to a developer who worked through the math by hand to identify the problem.
Security Vulnerabilities
The following security vulnerabilities provide us with useful lessons.
- Sweyntooth is a family of 12 public BLE vulnerabilities, as well as other non-disclosed vulnerabilities, that result in crashes, deadlocks, and bypassing BLE secure connections. This vulnerability exposed flaws in vendor security testing processes and BLE certification testing.
- BootHole was a vulnerability in the popular GRUB2 bootloader that is widely used on Linux and Windows systems. A buffer overflow when reading the
grub.cfg file enabled execution of arbitrary code and enabled attackers to bypass secure boot.
31 October 2019 by Phillip JohnstonA safety plan is a cornerstone of embedded device safety with respect to safety-critical systems. The safety plan outlines the relevant safety standard(s), identifies hazards and risks, enumerates safety goals, and describe how and why system safety is ensured. You can explore other aspects of embedded systems safety in the main topic. Table of Contents: Elements Anti-Patterns Lectures Articles From Around the Web Elements A a safety plan should reference a safety standard that is compatible with the target application and its intended use case. With the framework of the chosen safety standard, multiple iterations …
Continue reading “Safety Plan”