Musings on Supply Chain Vulnerability in Light of "The Big Hack"

Updated: 20190302

At the beginning October 2018, Bloomberg dropped a bomb by publishing "The Big Hack: How China Used a Tiny Chip to Infiltrate America's Top Companies".

The story claims that between 2014-2015, Supermicro server motherboards had a small IC inserted onto the PCB, possibly connected to the baseboard management controller (BMC). The chip allowed attackers to alter "the server's Core OS so it could accept modifications and contact attacker-controlled computers for further instructions/code". Supermicro subcontractors installed these chips at the behest of the People's Liberation Army (PLA) of China. The hacked server motherboards (supposedly) made it into the hands of Apple, Amazon, and 30 other unnamed companies.

Unfortunately, the lack of evidence of hacked hardware, a list of affected SKUs, or confirmation by other agencies casts doubt Bloomberg's story. Regardless, the story does a great job at highlighting the vulnerability of modern technological supply chains.

The Global Technological Supply Chain

Modern supply chains span the globe and are difficult for any one individual to fully understand. Building a product involves a chain of factories, component suppliers, logistics agencies, and warehouses. Modern products are complex, and supply chains involving fifty suppliers are common. The supply chain web is further complicated by the fact that participants are usually located within different provinces and countries.

Supplier counts don't include the subcontractors that your suppliers don't mention to you, and these subcontractors may change between production runs. Getting visibility into lower-tier suppliers and subcontractors can be difficult. Apple is one of the rare companies concerned enough about supply chain risk to go through the expensive and time-consuming effort of auditing the full supply chain.

Hardware products are increasingly complex, involving hundreds of electrical components, 12-40 layer PCBs, and interacting pieces of firmware. To reduce costs or schedules, companies outsource electrical/mechanical design and software development work to third-parties. These tasks are often outsourced to the CM or ODM, or they may involve any number of shops in Eastern Europe, India, China, Hong Kong, Taiwan, or Korea. Most systems are too large for any one person to fully understand, and outsourcing engineering work guarantees this.

China has a near monopoly on technological manufacturing, and practically every company I speak with sets its sights on manufacturing in China. The International Business Times put together an infographic based on 2011 economic data that shows China's manufacturing dominance: the country produces 75% of the world’s mobile phones and 90% of the world’s computers. At this point, China has the technological manufacturing experience, capabilities, and rare earth metal supplies to guarantee it remains the dominant manufacturing force.

With such a complex ecosystem involving numerous suppliers in different countries: how can any company truly guarantee that their products are secure? Any such assertion is suspicious.

Longstanding Concerns

Concerns about supply chain vulnerabilities are not new. We touched on supply chain issues in the January 2018 newsletter which focused on the problem of electrical component counterfeiting. While companies might care about counterfeit components leading to product issues, the US government views the vulnerability of modern supply chains as a national security concern. Government agencies have sounded alarms regarding supply chain security or taken direct action to ban Chinese suppliers.

In 2005, the Defense Science Board published a "Report on High Performance Microchip Supply" calling for urgent action to ensure a trusted and secure supply for integrated circuit components, citing concerns about component trustworthiness and clandestine manipulation risks:

Because of the U.S. military dependence on advanced technologies whose fabrication is progressively more offshore, opportunities for adversaries to clandestinely manipulate technology used in U.S. critical microelectronics applications are enormous and increasing. In general, sophisticated, clandestine services develop opportunities to gain close access to a target technology throughout its lifetime, not just at inception.

The DSB report emphasized the risk level with this comment (emphasis mine):

One unintended result of this otherwise sound industry change is the relocation of critical microelectronics manufacturing capabilities from the United States to countries with lower-cost capital and operating environments. From a U.S. national security view, the potential effects of this restructuring are so perverse and far reaching and have such opportunities for mischief that, had the United States not significantly contributed to this migration, it would have been considered a major triumph of an adversary nation’s strategy to undermine U.S. military capabilities.

In 2012, another report by Northrop Grumman for the US-China Economic and Security Review Commission warned of supply chain vulnerability. The report discusses the PLA's information warfare strategy, which centers around seizing control of an adversary's information systems. The report highlights the vulnerability of American hardware manufactures with factories in China:

Without strict control of this complex upstream channel, a manufacturer of routers, switches, or other basic telecommunications hardware is exposed to innumerable points of possible tampering and must rely on rigorous and often expensive testing to ensure that the semiconductors being delivered are trustworthy and will perform only as specified, with no additional unauthorized capabilities hidden from view.

In 2012, a congressional report from the US House Intelligence Committee claimed that Huawei and ZTE, two Chinese manufacturers of telecom equipment and phones, pose a thread to US national security. The US and Australian governments especially mistrust Huawei due to their connection with the PLA and Chinese government. Also in 2012, Australia banned Huawei components from being used in its high-speed broadband network. Later, in 2014, the US government banned Huawei and ZTE from bidding on government contracts.

This year, we saw another wave of anti-Huawei and ZTE action. In February 2018, US intelligence agencies warned that American citizens shouldn’t use products and services produced by Huawei and ZTE. In August 2018, Australia banned Huawei from supplying equipment for their 5G mobile network due to national security concerns. Also in August 2018, Trump signed a ban on Huawei and ZTE technology use within the US government and its contractors:

The ban covers the use of Huawei and ZTE components or services that are “essential” or “critical” to the system they’re used in. Some components from these companies are still allowed, so long as they cannot be used to route or view data. The bill also instructs several government agencies, including the Federal Communications Commission, to prioritize funding to assist businesses that will have to change their technology as a result of the ban.

US chip manufacturers have repeatedly accused China of intellectual property theft. At the beginning of November 2018, the US Justice Department indicted the state-owned Chinese firm Fujian Jinhua, Taiwanese firm UMC, and three Taiwanese nationals with conspiracy to steal DRAM technology from Micron. The government has banned US companies from selling equipment, software, and materials to Fujian Jinhua due to national security risks.

“Placing Jinhua on the entity list will limit its ability to threaten the supply chain for essential components in our military systems." - Wilbur Ross, U.S. Commerce Secretary

Outside of national security risks, consolidation brings its own concerns. Most companies run their infrastructure on servers that they rent from a small number of large companies: Microsoft, Google, Amazon. These companies run numerous server farms and are responsible for millions of pieces of hardware, likely consolidated around a few base hardware designs. Such large-scale operations involve orders and suppliers which are hard to disguise. Server access is a major information warfare battleground, and compromising a major server provider would be a major strategic gain.

"Someone Would Have Noticed"

While reading commentary about the Big Hack, I've seen confident claims that the attack is impossible:

  • "It seems implausible that no one at Super Micro looked at the boards they received from China and realized that they had an extra chip that wasn't present in the design. What kind of quality controls do they apply?"
  • "Fitting any chip capable of exfiltrating a nontrivial amount of data onto a modern motherboard without going through many rounds of simulation or significantly impacting performance, while also putting it in a place it is capable of intercepting valuable data is practically impossible. Hell, just getting the right power domains wired to the chip is going to be tough enough.
  • "Still, to actually accomplish a seeding attack would mean developing a deep understanding of a product’s design, manipulating components at the factory, and ensuring that the doctored devices made it through the global logistics chain to the desired location"

I think such statements expose a lack of experience with developing modern hardware products and a misplaced trust in the security of the supply chain. In our experience, teams are rarely as rigorous as armchair experts think they are. The claims above ignore the following:

  • No one performs part-for-part checks between the layout, stuffed board, and the bill-of-materials (BOM)
  • Rework is common, and many engineers would not question extra wires or chips soldered onto their boards - they just assume something was wrong with the current design iteration or manufacturing run
  • Fully populated circuits are an overwhelming mess of chips of various sizes making it pretty easy to sneak another one in there
  • Many designs have empty pads for stuffing options, alternative parts, or debugging - a part connected to one of these areas would not set off visual red flags
  • Designs can be largely or completely outsourced, and engineers at the parent company may not be familiar enough with the design to visually know whether a component or rework is legitimate or not
  • Aside from rare cases such as Apple, hardware companies don't invest in inspecting hardware for tampering
  • Once the engineering team leaves the factory, there is little-to-no oversight into how the CM assembles the products aside from automated test stations and CM-produced reports
  • Once a product has reached the mass production stage, engineers are not involved in performing quality checks or ensuring the board is populated as expected - products ship directly from the factory through the logistics company to warehouses or customers

Some who disagree with the feasibility of such an attack emphasize that automated optical inspection (AOI) would have caught an extra chip. This claim ignores the fact that it’s possible to rework a design after performing the AOI step without subjecting it to additional process controls. Most of the rework for systems that we’ve worked on has occurred after any AOI step. To get rework past a manual inspection step, you simply tell the factory operators to expect the rework when inspecting boards.

Claiming it’s difficult to understand a product’s design without being the designer is also false. When a company outsources a design to a CM, ODM, or third-party, they are no longer the sole possessor of the knowledge. Many CM teams work directly with parent company engineers to validate products to ensure they work as expected or to refine the design for improved manufacturability. Furthermore, China has become infamous for IP theft due to reverse engineering designs and producing product clones. We don’t even need to focus on China - American companies perform reverse engineering efforts all the time. Hardware teams perform teardowns of competitor products to analyze their designs and figure out how they work. Government agencies work with defense contractors to reverse engineer military technology from other nations. Understanding and modifying a product requires effort and motivation, and intelligence agencies don’t lack these two ingredients.

Now that we've shared our thoughts, we can say that we don't believe what Supermicro told Congress:

In fact, we believe that it is impossible as a practical matter to insert unauthorised malicious chips onto our boards during the manufacturing process. [...] [Supermicro's] test processes at every step are designed to alert us to any discrepancies from our base design.

The CM can (and do) fake test results using a variety of methods so yield fallout goes unnoticed. Perhaps Supermicro truly thinks their manufacturing process is locked down better than Apple's, but we have our doubts.

Chinese manufacturers, engineers, and suppliers are far more capable than armchair product developers and manufacturing engineers give them credit for. Wired's excellent Shenzhen: The Silicon Valley of Hardware documentary will give you a brief glimpse into the Chinese manufacturing ecosystem. China is the center of electronics manufacturing, and the feats that can be performed with techniques available on the open market are truly amazing. The story cites the involvement of a government intelligence agency, and government-funded techniques are often several steps ahead of the publicly available techniques. So much of our own technological advances are a result of defense research.

History of Attacks

Another aspect to consider for those who claim such a hardware hack is impossible is the hardware implant feats performed by Cold War intelligence agencies and the modern NSA.

In 1945 the Soviet Union spied on the US Ambassador's office using a listening device which was capable of transmitting audio signals without having its own power supply. The device only became active when they broadcast the correct radio frequency to it from an external transmitter. Since the device was small and had no power supply or active components, it escaped detection for seven years. There are claims that powering the device would be too difficult to pull off, yet the world is much more technologically capable now than it was in 1945.

Edward Snowden famously exposed that the NSA routinely intercepts routers while in transit and modifies them to insert surveillance equipment. After modification they repackage the devices with a new factory seal and send them on to the recipient.

We also see researchers demonstrating hardware hacking concepts, such as a "chip-in-the-middle" attack on a smartphone touchscreen which enabled them to spy on the device without modifying firmware.

Bunnie Huang shared his own lessons from attempting to authenticate his supply chain in a talk on Supply Chain Security. He also walks through categories of possible attacks, some which cost a penny and a few seconds, others which take months to pull off (such as the hack described by Bloomberg).

Hardware hacks are certainly more difficult and logistically intensive than a software hack, but but they are far from impossible.

What to Make of This?

Now, I’m not saying that the Bloomberg story is accurate. We don’t have any evidence of hacked hardware, and we don’t have named sources. Trusting the story is ill-advised. What I’m saying is that our supply chains are vulnerable to such an attack, and that it’s not even in the realm of “extremely difficult” to pull off.

Our designs and supply chain networks are too complex for us to fully understand, and with that comes increased risk. Matt Levine, who publishes the Money Stuff column (and newsletter), highlighted this point in his October 4 newsletter (emphasis mine):

I think best read not as a story about computer hardware or national security or supply-chain management, but about the epistemology of capitalism. Our lives are, increasingly, lived on and through computers, and a deep lesson of this story is that not even the smartest engineer at Apple Inc.—not even all of the engineers at Apple combined—can fully understand how those computers work at every level. More than that, our business works through global supply chains, and no one involved in a complicated global supply chain can fully comprehend it. All of the knowledge about everything that matters is distributed; if you are a practical sort you can build a house or grow corn yourself, but you cannot take sand and oil and build a modern computer and program it to run Microsoft Word. You can’t even describe all the steps of that process in a reasonably satisfying way.

For the most part this is a good feature of modern life: By distributing knowledge, and creating networks of trust and trade to unlock that distributed knowledge, we enable complexity, and we can build more and better products and do more and better stuff and have a greater total sum of useful knowledge. But it does put a lot of stress on those networks of trust and trade. If someone decides to insert a rice-sized hack into the computer’s supply chain, you can’t easily protect yourself. You can’t fully grasp the computer, or the supply chain; the checks and trust networks are themselves disaggregated. The vulnerability is the price of modernity.

Having less blind trust in corporations and their supply chains is a healthy mindset.

Change Log

  • 20181219:
    • Add comment from Supermicro regarding the impossibility of the attack
    • Added "Related Articles"
  • 20190302:
    • Added additional links

Related Articles

Musings on Tight Coupling Between Firmware and Hardware

Firmware applications are often tightly coupled to their underlying hardware and RTOS. There is a real cost associated with this tight coupling, especially in today's increasingly agile world with its increasingly volatile electronics market.

I've been musing about the sources of coupling between firmware and the underlying platform. As an industry, we must focus on creating abstractions in these areas to reduce the cost of change.

Let's start the discussion with a story.

Table of Contents

  1. The Hardware Startup Phone Call
  2. Coupling Between Firmware and Hardware
    1. Processor Dependencies
    2. Platform Dependencies
    3. Component Dependencies
    4. RTOS Dependencies
  3. Why Should I Care?

The Hardware Startup Phone Call

I'm frequently contacted by companies that need help porting their firmware from one platform to another. These companies are often on tight schedules with a looming development build, production run, or customer release. Their stories follow a pattern:

  1. We built our first version of software on platform X using the vendor SDK and vendor-recommended RTOS
  2. We need to switch to platform Y because:
    1. X is reaching end of life
    2. We cannot buy X in sufficient quantities because Big Company bought the remaining stock
    3. Y is cheaper
    4. Y's processor provides better functionality / power profile / peripherals / GPIO availability
    5. Y's components are better for our application's use case
  3. Platform Y is based on a different processor vendor (i.e. SDK) and/or RTOS
  4. Our engineer is not familiar with Platform Y's processor/components/SDK/RTOS
  5. The icing on the cake: We need to have our software working on Platform Y within 30-60 days

After hearing the details of the project, I ask my first question, which is always greeted with the same answer:

Phillip: Did you create abstractions to keep your code isolated from the vendor SDK or RTOS?

Company: No. We're a startup and we were focused on moving as quickly as possible

I'll then ask my second question, which is always greeted with the same answer:

Phillip: Do you have a set of unit/functional tests that I can run to make sure the software is working correctly after the port?

Company: No. We're a startup and we were focused on moving as quickly as possible

Then I'll ask the final question, which is always greeted with the same answer:

Phillip: How can I tell whether or not the software is working correctly after I port it?

Company: We'll just try it out and make sure everything works

Given these answers, there's practically no chance I can help the company and meet their deadlines. If there are large differences in SDKs and RTOS interfaces, the software has to be rewritten from scratch using the old code base as a reference.

I also know that if I take on the project, I'm in for a risky business arrangement. How can I be sure that my port was successful? How can I defend myself from the client's claim that I introduced issues without having a testable code base to compare against?

Why am I telling you this story?

Because this scenario arises from a single strategic failure: failure to decouple the firmware application from the underlying RTOS, vendor SDK, or hardware. And as an industry we are continually repeating this strategic failure in the name of "agility" and "time to market".

These companies fail to move quickly in the end, since the consequences of this strategic blunder are extreme: schedule delays, lost work, reduced morale, and increased expenditures.

Coupling Between Firmware and Hardware

Software industry leaders have been writing about the dangers of tight coupling since the 1960s, so I'm not going to rehash coupling in detail. If you're unfamiliar with the concept, here is some introductory reading:

In Why Coupling is Always Bad, Vidar Hokstad brings up consequences of tight coupling, two of which are relevant for this musing:

  • Changing requirements that affect the suitability of some component will potentially require wide ranging changes in order to accommodate a more suitable replacement component.
  • More thought need to go into choices at the beginning of the lifetime of a software system in order to attempt to predict the long term requirements of the system because changes are more expensive.

We see these two points play out in the scenario above.

If your software is tightly coupled to the underlying platform, changing a single component of the system - such as the processor - can cause your company to effectively start over with firmware development.

The need to swap components late in the program (and the resulting need to start over with software) is a failure to perform the up-front long-term thinking required by tightly coupled systems. Otherwise, the correct components would have been selected during the first design interation, rendering the porting process unnecessary.

Let's review on a quote from Quality Code is Loosely Coupled:

Loose coupling is about making external calls indirectly through abstractions such as abstract classes or interfaces. This allows the code to run without having to have the real dependency present, making it more testable and more modular.

Decoupling our firmware from the underlying hardware is As Simple As That™.

Up front planning and design is usually minimized to keep a company "agile". However, without abstractions that easily enable us to swap out components, our platform becomes tied to the initial hardware selection.

You may argue that taking the time to design and implement abstractions for your platform introduces an unnecessary schedule delay. How does that time savings stack up against the delay caused by the need to rewrite your software?

We all want to be "agile", and abstractions help us achieve agility.

What is more agile than the ability to swap out components without needing to rewrite large portions of your system? You can try more designs at a faster pace when you don't need to rewrite the majority of your software to support a new piece of hardware.

Your abstractions don't need to be perfect. They don't need to be reusable on other systems. But they need to exist if you want to move quickly.

We need to start producing abstractions that minimize the four sources of tight coupling in our embedded systems:

  1. Processor Dependencies
  2. Platform Dependencies
  3. Component Dependencies
  4. RTOS Dependencies

Processor Dependencies

Processor dependencies are the most common form of coupling and arise from two major sources:

  1. Using processor vendor SDKs
  2. Using APIs or libraries which are coupled to a target architecture (e.g. CMSIS)

Processor-level function calls are commonly intermixed with application logic and driver code, ensuring that the software becomes tightly coupled to the processor. De-coupling firmware from the underlying processor is one of the most important for design portability and reusability.

In the most common cases, teams will develop software using a vendor's SDK without an intermediary abstraction layer. When the team is required to migrate to another processor or vendor, the coupling to a specific vendor's SDK often triggers a rewrite of the majority of the system. At this point, many teams realize the need for abstraction layers and begin to implement them.

In other cases, software becomes dependent upon the underlying architecture. Your embedded software may work on an ARM system, but not be readily portable to PIC, MIPS, AVR, or x86 machine. This is common when utilizing libraries such as CMSIS, which provides an abstraction layer for ARM Cortex-M processors.

A more subtle form of architecture coupling can occur even when abstraction layers are used. Teams can create abstractions which depend on a specific feature, an operating model particular to a single vendor, or an architecture-specific interaction. This form of coupling is less costly, as the changes are at least isolated to specific areas. Interfaces may need to be updated and additional files may need to change, but at least we don't need to rewrite everything.

Platform Dependencies

Embedded software is often written specifically for the underlying hardware platform. Rather than abstracting platform-specific functionality, embedded software often interacts directly with the hardware.

Without being aware of it, we develop our software based on the assumptions about our underlying hardware. We write our code to work with four sensors, and then in the second version we only need two sensors. However, you need to support both version one and version two of the product with a single firmware image.

Consider another common case, where our software supports multiple versions of a PCB. Whenever a new PCB revision is released, the software logic must be updated to support the changes. Supporting multiple revisions often leads to #ifdefs and conditional logic statements scattered throughout the codebase. What happens when you move to a different platform, with different revision numbers? Wouldn't it be easier if your board revision decisions were contained in a single location?

When these changes come, how much of your code needs to be updated? Do you need to add #ifdef statements everywhere? Do your developers cringe and protest because of the required effort? Or do they smile and nod because it will only take them 15 minutes?

We can abstract our platform/hardware functionality behind an interface (commonly called a Board Support Package). What features is the hardware platform actually providing to the software layer? What might need to change in the future, and how can we isolate the rest of the system from those changes?

Multiple platforms & boards can be created that provide same set of functionality and responsibilities in different ways. If our software is built upon a platform abstraction, we can move between supported platforms with greater ease.

Component Dependencies

Component Dependencies are a specialization of the platform dependency, where software relies on the presence of a specific hardware component instance.

In embedded systems, software is often written to use specific driver implementations rather than generalized interfaces. This means that instead of using a generalized accelerometer interface, software typically works directly with a BMA280 driver or LIS3DH driver. Whenever the component changes, code interacting with the driver must be updated to use the new part. Similar to the board revision case, we will probably find that #ifdefs or conditionals are added to select the proper driver for the proper board revision.

Higher-level software can be decoupled from component dependencies by working with generic interfaces rather than specific drivers. If you use generic interfaces, underlying components can be swapped out without the higher-level software being aware of the change. Whenever parts need to be changed, your change will be isolated to the driver the declaration (ideally found within your platform abstraction).

RTOS Dependencies

An RTOS's functions are commonly used directly by embedded software. When a processor change occurs, the team may find that the RTOS they were previously using is not supported on the new processor.

Migrating from one RTOS to another requires a painful porting process, as there are rarely straightforward mappings between the functionality and usage of two different RTOSes.

Providing an RTOS abstraction allows platforms to use any RTOS that they choose without coupling their application software to the RTOS implementation.

Abstracting the RTOS APIs also allows for host-machine simulation, since you can provide a pthreads implementation for the RTOS abstraction.

Why Should I Care?

It's a fair question. Tight coupling in firmware has been the status quo for a long time. You may claim it still must remain that way due to resource constraints.

Vendor SDKs are readily available. You can start developing your platform immediately. The rapid early progress feels good. Perhaps you picked all the right parts, and the reduced time-to-market will actually happen for your team.

If not, you will find yourself repeating the cycle and calling us for help.

It's not all doom and gloom, however. There are great benefits from reducing coupling and introducing abstractions.

  • We can rapidly prototype hardware without triggering software rewrites
  • We can take better advantage of unit tests, which are often skipped on embedded projects due to hardware dependencies
  • We can implement the abstractions on our host machines, enabling developers to write and test software on their PC before porting it to the embedded system
  • We can reuse subsystems, drivers, and embedded system applications on across an entire product line

I'll be diving deeper into some of these beneficial areas in the coming months.

In the mean time - happy hacking! (and get to those abstractions!)

Related Posts