Case Study: StuxNet

18 August 2020 by Phillip Johnston • Last updated 15 August 2023Stuxnet was a virus that targeted Siemens PLCs and was responsible for significantly damaging Iran’s nuclear program. The virus was first uncovered in 2010, but is thought to have been in development for at least five years prior to that. Stuxnet caused centrifuges at the nuclear plant to spin out of control while operator screens reported nominal values, leading to systems failures, asset damage, safety concerns, and a national security fiasco. This virus is particularly remarkable because it attacked systems that were not connected to any networks with access …

To access this content, you must purchase a Membership - check out the different options here. If you're a member, log in.

Case Study: St. Jude Pacemaker Recall

18 August 2020 by Phillip Johnston • Last updated 15 August 2023465,000 U.S. patients were been told to visit a clinic to receive a firmware update for their St. Jude pacemakers. The firmware contains a security flaw which allows hackers within radio range to take control of a vulnerable pacemaker. Identified attacks include: “Crash attacks”, which involve broadcasting a combination of signals that place cardiac devices into a state of malfunction “Battery drain attacks”, which generates signals from the Merlin@home device to run down batteries in a Cardiac Device at a “greatly accelerated rate”. While St. Jude and other parties, …

To access this content, you must purchase a Membership - check out the different options here. If you're a member, log in.

Case Study: Boeing 737 MAX Crashes

18 August 2020 by Phillip Johnston • Last updated 10 June 2021The Boeing 737 MAX-8 and MAX-9 aircraft were grounded after Ethiopian Airlines and Lion air crashes both resulted in the deaths of everyone on board. The implicated system is the the Maneuvering Characteristics Augmentation System (MCAS), which is part of the flight management computer software. The MCAS was designed to correct for an increased potential to stall the plane due to mechanical design changes. When fed an Angle-of-Attack reading from a bad sensor, the MCAS triggered at an improper time, forcing the plane nosedown and overriding pilot input. The …

To access this content, you must purchase a Membership - check out the different options here. If you're a member, log in.

Case Study: Toyota Unintended Acceleration

18 August 2020 by Phillip Johnston • Last updated 13 September 2022Unintended acceleration, or the loss of driver control over engine power, in Toyota cars is suspected in the deaths of at least 89 people and injuries to at least 57 more (with hundreds of additional cases being settled). Toyota, the Department of Transportation, the U.S. National Highway Traffic Safety Administration, and journalists cited “driver error” or “stuck acceleration pedals” due to floor mats as the primary cause. The official finding of the joint NHTSA and NASA investigation confirmed this opinion. NASA’s team did find one theoretical way for Toyota’s …

To access this content, you must purchase a Membership - check out the different options here. If you're a member, log in.

Case Study: Patriot Missile Failure at Dhahran

18 August 2020 by Phillip Johnston • Last updated 15 August 2023An Iraqi Scud missile hit baracks in Dhahran, Saudi Arabia, after a Patriot missile defense system failed to intercept the missile. The accident resulted in 28 U.S. soldiers killed and 98 soldiers wounded. The failure to intercept the missile was caused by a compounding software clock drift error resulting in a distance calculation error of 687 meters. Because of the drift and corresponding distance offset, the system determined that the missile was on a spurious track and did not fire. Further Information For more on the event and the …

To access this content, you must purchase a Membership - check out the different options here. If you're a member, log in.

Case Study: Therac-25

18 August 2020 by Phillip Johnston • Last updated 13 September 2022The Therac-25 deaths are a canonical example of software accidents. Two different errors caused multiple patients to receive massive overdoses of radiation, resulting in serious injuries or death. Case Studies This is a well documented accident, so we will refer you to the following sources for understanding what went wrong: Summary Video by Phil Koopman Wikipedia: Therac-25 IEEE: An Investigation of the Therac-25 Accidents, by Nancy Leveson and Clark Turner, is one of the original investigatory articles published on the topic Medical Devices: The Therac-25, by Nancy Leveson, is an updated and …

To access this content, you must purchase a Membership - check out the different options here. If you're a member, log in.

Coding Standard: Power of Ten

29 April 2020 by Phillip Johnston • Last updated 15 August 2023The Power of Ten is a popular set of coding “rules” for writing safety critical software that originally appeared in IEEE Computer in June 2006. These rules have been floating around for a while, and the odds are good that you’ve heard someone mention them. Even they’re they’re nominally related to safety critical software, they are excellent rules to follow for general embedded systems development. Here are the 10 rules: Restrict to simple control flow constructs. Give all loops a fixed upper-bound. Do not use dynamic memory allocation after …

To access this content, you must purchase a Membership - check out the different options here. If you're a member, log in.

Case Study: Uber ATG Crash in Tempe, Arizona

27 January 2020 by Phillip Johnston • Last updated 15 August 2023On 19 March, 2019, a woman was struck and killed by an Uber autonomous vehicle operating in Tempe, Arizona. The pedestrian was jaywalking at the time of the crash, the vehicle operator was distracted by her phone, and automated collision detection and braking systems supplied by the auto manufacturer were disabled by Uber. This case study is part of our analysis of historical software accidents and errors. To make the situation more disheartening, the self-driving software identified the pedestrian “object” 5.6 seconds before impact. Because the system did not …

To access this content, you must purchase a Membership - check out the different options here. If you're a member, log in.

Historical Software Accidents and Errors

We never learned about significant software accidents and errors during our education or professional careers. Sometimes, we would see an event in the news and follow it, but that was often the extent of our learning. Rarely did we find that lessons were codified and applied to our work.

Quote

My original account of the Therac-25 losses said that accidents are seldom simple. They usually involve a complex web of interacting events with multiple contributing technical, human, organizational, and regulatory factors. We aren’t learning enough today from the events nor focusing enough on preventing them. It’s time for computer science practitioners to be better educated about engineering for safety.
– Nancy Leveson, “The Therac-25: 30 Years Later”

As our world grows more dependent on software, developers everywhere need to study historical software problems, their causes, and the implications for how we build our software. We cannot afford to keep repeating the same mistakes, especially those that cause harm.

Table of Contents:

  1. Learning from Case Studies
  2. General Overviews
  3. Major Accidents and Errors
  4. Minor Accidents and Errors
  5. Security Vulnerabilities

Learning from Case Studies

If you have not studied case studies before, you might be stumped at how to draw useful lessons from them. We recommend these entries:

General Overviews

Major Accidents and Errors

The following software-related accidents and errors resulting in loss of life or significant economic impact:

  1. The Therac-25 deaths are a canonical example of software accidents. Two different errors caused multiple patients to receive massive overdoses of radiation, resulting in serious injuries or death.
  2. An Iraqi Scud missile hit baracks in Dhahran, Saudi Arabia, after a Patriot missile defense system failed to intercept the missile. The accident resulted in 28 U.S. soldiers killed and 98 soldiers wounded. The failure to intercept the missile was caused by a compounding software clock drift error resulting in a distance calculation error of 687 meters. Because of the drift and corresponding distance offset, the system determined that the missile was on a spurious track and did not fire.
  3. Stuxnet was a virus that targeted Siemens PLCs and was responsible for significantly damaging Iran’s nuclear program. Stuxnet caused centrifuges at the nuclear plant to spin out of control while operator screens reported nominal values, leading to systems failures, asset damage, safety concerns, and a national security fiasco.
  4. Unintended acceleration, or the loss of driver control over engine power, in Toyota cars is suspected in the deaths of at least 89 people and injuries to at least 57 more (with hundreds of additional cases being settled). Toyota and the Department of Transportation historically cited “driver error” or “stuck acceleration pedals” due to floor mats as the cause. The Barr Group determined that the Electronic Throttle Control System (ETCS) source code was of “unreasonable quality”and contained bugs that could cause unintended acceleration, that the fail safes were defective and inadequate, and that Toyota did not comply with standards, both internal and external.
  5. The Boeing 737 MAX-8 and MAX-9 aircraft were grounded after Ethiopian Airlines and Lion air crashes both resulted in the deaths of everyone on board. The implicated system is the the Maneuvering Characteristics Augmentation System (MCAS), which is part of the flight management computer software. The MCAS was designed to correct for an increased potential to stall the plane due to mechanical design changes. When fed an Angle-of-Attack reading from a bad sensor, the MCAS triggered at an improper time, forcing the plane nosedown and overriding pilot input.
  6. On 19 March, 2019, a woman was struck and killed by an Uber autonomous vehicle operating in Tempe, Arizona. The pedestrian was jaywalking at the time of the crash, the vehicle operator was distracted by her phone, and automated collision detection and braking systems supplied by the auto manufacturer were disabled by Uber.

Minor Accidents and Errors

The following accidents and errors are “minor” because they primarily resulted in malfunctioning equipment or annoyances that did not result in loss of life or significant economic damages. Linked entries contain key lessons from these events.

  1. In 2016, Muddy Waters announced security vulnerabilities in implantable St. Jude pacemakers. These vulnerabilities could result in a crash of the device or increased battery drain. The FDA issued a “recall” (for a firmware update) in 2017. Thankfully, no related deaths have been reported.
  2. In April 2019, several systems failed to handle a known GPS date stamp rollover, even though this is the second GPS rollover that has occurred since the protocol was developed.
  3. In 2019, many Tesla repair professionals reported that Tesla Model S and X Media Control Units were wearing out their eMMC memory. When the eMMC fails, drivers cannot use the in-vehicle display, climate control, autopilot, lighting control, or vehicle charging. This issue was caused by excessive logging to the eMMC, which has a limited number of write cycles.
  4. A long-standing bug in the Advanced Radiation Detection Capability Unit, a system used to detect nuclear explosions, was identified and resolved thanks to a developer who worked through the math by hand to identify the problem.

Security Vulnerabilities

The following security vulnerabilities provide us with useful lessons.

  1. Sweyntooth is a family of 12 public BLE vulnerabilities, as well as other non-disclosed vulnerabilities, that result in crashes, deadlocks, and bypassing BLE secure connections. This vulnerability exposed flaws in vendor security testing processes and BLE certification testing.
  2. BootHole was a vulnerability in the popular GRUB2 bootloader that is widely used on Linux and Windows systems. A buffer overflow when reading the grub.cfg file enabled execution of arbitrary code and enabled attackers to bypass secure boot.

Safety Plan

31 October 2019 by Phillip JohnstonA safety plan is a cornerstone of embedded device safety with respect to safety-critical systems. The safety plan outlines the relevant safety standard(s), identifies hazards and risks, enumerates safety goals, and describe how and why system safety is ensured. You can explore other aspects of embedded systems safety in the main topic. Table of Contents: Elements Anti-Patterns Lectures Articles From Around the Web Elements A a safety plan should reference a safety standard that is compatible with the target application and its intended use case. With the framework of the chosen safety standard, multiple iterations …

To access this content, you must purchase a Membership - check out the different options here. If you're a member, log in.