Continue reading "From Concept to Launch: What It Takes to Build and Ship a New Device"
Exclusive Use of the Customer-Facing Firmware Update Mechanism
29 March 2023 by Phillip Johnston • Last updated 6 February 2025Firmware update solutions must be reliable – especially over-the-air updates. A failure in these processes can be catastrophic: completely bricked devices that cannot be remotely recovered, but instead must be RMA’d. In the worst case scenario, one bad update might mean bricking all of the devices in your fleet. This means the goal is to get as many iterations with these mechanisms as possible before releasing devices to your customers. There is one simple step you can take to maximizing the amount of iterations you get with the customer-facing …
Continue reading “Exclusive Use of the Customer-Facing Firmware Update Mechanism”
Use Different Keys for Development and Production
29 March 2023 by Phillip JohnstonWhen using techniques like Code Signing, you need to implement Secure Secret Storage. Your private keys cannot be leaked, or else the whole signing and verification mechanism breaks. A common failure mode with this strategy is sharing private keys for development work. Proper key management is inconvenient, especially when you want to restrict access as much as possible. But developers need to create images and test them on-device, so they need access to the private keys to sign their images. Keys end up in git repositories, passed around via email or Slack, or uploaded to …
Continue reading “Use Different Keys for Development and Production”
Mature Teams Prioritize End-to-End Interactions
27 March 2023 by Phillip JohnstonEmbedded systems are increasingly complex, especially those connected to the internet. In many cases, they are better viewed as one part of a larger distributed system involving the device, one or more phone applications, backend servers, fleet management, CI/CD servers, and more. Many teams do not approach product development with this distributed nature in mind. The “old ways” are followed instead: first, make all of the individual components work on their own, then integrate them together. The firmware team will work in their own world until the system is mostly completed, as will the app …
Continue reading “Mature Teams Prioritize End-to-End Interactions”
Store and Index Software Build Artifacts
27 March 2023 by Phillip JohnstonEmbedded device populations always have mix of software versions. This is true whether or not you are in development or whether the device has been released to customers. Because function and variable address locations will change from one build to another, you must be able to access build artifacts for all versioned builds to debug them appropriately. Without a foolproof system in place for storing and indexing artifacts, these files can go missing or be mislabeled. Indexing these artifacts is essential from the perspective of being able to find the proper files when needed. You …
Continue reading “Store and Index Software Build Artifacts”
Firmware Update Support
Firmware update support is an essential capability for contemporary embedded devices, regardless of whether or not they are connected to the internet on a regular basis. Firmware updates are used to add new capabilities after launch, correct errors, and address security vulnerabilities. Firmware update support also ensures that devices can remain useful for a longer period of time, as the development team can respond to changes in the operating environment and customer expectations.
Supporting firmware updates for your system requires a number of supporting device-side and infrastructure capabilities. Update reliability is also significantly aided by adopting supporting processes in your organization.
Firmware updates are a good example of Software Engineering as applied to embedded systems. We have to carefully design the update mechanism, account for failure modes, and make tradeoffs based on our system’s design goals and constraints.
Table of Contents:
- Device Capabilities
- Infrastructure Capabilities
- Supporting Processes
- Accounting for the Possibility of Failure
- Sub-topics and Variations
- Case Studies of Update-related Problems and Vulnerabilities
- Related Blog Posts
- References
Device Capabilities
Required
- Device software is split into multiple images.
- At a minimum, you will need to split the device into a Bootloader and Application.
- Software may be further refined depending on reliability and update schemes, such as into a Loader, a distinct Updater, or a Fallback Image
- Fail-safe support in case of an update failure or bad update
- An integrity check, ensuring that the provided binary has not been corrupted during checksum
- The device can report its software version
- The update mechanism is resilient against power and network loss during the update process. This is necessary to avoid bricking devices!
- Ideally, updates will be “atomic” and
- A method for receiving firmware updates (whether via USB connection, SD card, or Over-the-Air)
Recommended
- Code signing support, which is used to verify both provenance and integrity of an update
- Support for rolling back to a previous version on command (many implementations only allow you to increase the version)
- Version data storage, schemas, and communication protocols to support data migrations in response to an update process
- Ability to specify pre- and post-update actions (e.g., a script) in addition to the firmware update
- This can be extremely useful for supporting actions like data migrations or file removals, which might need to be executed after an update has completed.
- This can be useful for implementing post-update sanity checks to make sure that the update processes completed successfully. If the checks do not pass, roll back to the previous version.
Infrastructure Capabilities
Required
- The build system must produce unique software versions
- Store and index software build artifacts
- A mechanism for pushing a new update or indicating that a new update is available
Recommended
- Minimally, produce a checksum that can be used to verify the binary was transferred without error. Ideally, code signing will be used instead, as you can also verify that the update is coming from an authorized source.
- Cohort binning of devices enables you to control which devices receive specific firmware updates.
- This is useful for deploying beta builds to an interested population of beta testers.
- This is also a common way of implementing staged rollouts.
- Staged rollouts of firmware updates provides a safer update mechanism than a “deploy to everyone” approach. You start with a small population of devices to make sure that the update succeeds and does not introduce significant new issues. If everything looks good, you continue to roll out the update to increasingly large segments of your population.
- Ability to roll back firmware to a previous version in the event of a bad update
- Check-in and heartbeat messages are useful for determining:
- Whether or not an update was successful (a device will check in with a new firmware version)
- The distribution of versions throughout the fleet
- Often, teams are surprised to realize that there’s a distribution of versions, even when a new OTA update is released. Also, you will find that some devices never update.
Supporting Processes
- Exclusive use of the customer-facing firmware update mechanism to ensure its reliability
- Many teams leave OTA updates to the end of the project, for example, which is far too late in the process to ensure reliability. A better approach is implementing OTA updates first, and then requiring all development and internal testing to use OTA updates rather than JTAG or USB. This way, the update mechanisms receive significant mileage, and the kinks are worked out before the product is released to customers.
- Significant testing of the update process, especially with the use of fault injection to ensure that fallbacks and fail-safes work as intended
- Version Data Storage, Protocols, and Schemas
- Data migrations are a common challenge that you will need to deal with when updating devices. For example, you might update the “device settings” layout, change a communication protocol, or update an sqlite database schema.
- Without versioning these items, you cannot safely perform a migration as part of the firmware update process.
Accounting for the Possibility of Failure
Firmware updates can go wrong in many ways. It is important to ensure that your update system is resilient to all of these failures. After all, if you brick devices, you cannot remotely fix them.
To protect against data corruption during the transfer, you should compare the received contents against a checksum to ensure the integrity of the update. Ideally, however, you will use code signing to provide both an integrity check and an assurance that the build comes from an approved source.
Updates should, ideally, be atomic: either the whole update is applied, or no update occurs at all. This is especially important in guarding against corruption due to loss of network connectivity or loss of power during an update.The most common approach to atomic updates is to have dual application partitions in device storage, which we will call “A” and “B”. This approach is akin to the common “double buffering” pattern.
- The bootloader will boot from partition A, which is currently active.
- When an update is received, the contents will be placed into partition B.
- If the update process fails for any reason, the bootloader will continue to boot from partition A.
- If the update succeeds, the bootloader will boot from partition B.
- If there is a problem identified during the boot process, this can be indicated to the bootloader, which can automatically fall back to partition A
- When the next update is received, it will be placed into partition A.
However, memory and storage constraints can make atomic updates difficult to achieve with many embedded devices.
- RAM may be sufficiently limited such that an entire update payload cannot be received before being applied, but must be streamed to flash instead.
- This means that the contents of flash could be overwritten with data before it can be determined that the checksum or signature matches the expected value.
- Flash may be sufficiently limited such that there is not space for a bootloader, two complete applications, and other artifacts.
In cases where the dual partition approach cannot be used, we will create a “fallback” application, which effectively takes the place of the second partition:
- Updates will always be placed into the main application slot.
- If an update fails, or some problem is identified during the application boot process, the bootloader will automatically boot into the fallback application.
- The fallback application contains only the minimal amount of support to configure the processor and its components so that it can connect to the server and request a new update. This allows it to be much smaller than a complete second application.
This fallback firmware must be heavily tested so that it can be trusted to restore a system to a working state in the event of an update failure. Ideally, it will not need any updates once the device is deployed, as there is no fallback in place when the fallback firmware update fails.
If you cannot support any of these schemes with your current resources, you will need to add more storage or avoid OTA updates completely. You run the risk of a power or network failure completely disabling your devices in the field. Wired updates are less sensitive here, as long as you provide a tool that can be used to re-flash the device from a corrupted state (e.g., a DFU utility).
Sub-topics and Variations
- Over-the-Air Update [OTA] refers to sending updates to a system wirelessly
- Delta Update is an optimization of a full firmware update process that only involves transmitting pieces of the program that have changed. It is useful for systems with limited communication bandwidth or high communication costs.
- The Update Nightmare: Bricking Devices in the Field looks at real-world examples of failed updates as a motivation to support Staged Rollouts.
Case Studies of Update-related Problems and Vulnerabilities
- Another vulnerability in the LPC55S69 ROM / Oxide describes a problem in the LPC55S69 In-System Programming code for the signing mechanism, which allows an attacker to gain non-persistent code execution with a carefully crafted update regardless of whether the update is signed
Related Blog Posts
- Q&A: Where Should Firmware Update Capabilities Live?
- Q&A: How Do You Manage Updates that Introduce Incompatible Changes?
- Over-the-Air Updates and Fleet Management at Scale
References
Not Invented Here Syndrome is a Business Problem
8 December 2022 by Phillip Johnston • Last updated 14 February 2024“Not Invented Here” (NIH) syndrome is a significant problem for technology companies. NIH is a tendency to avoid code, products, standards, or techniques that come from outside of an organization. With software development, NIH is often associated with the idea that your internal team could do a better job, with a higher quality result, and incur a lower overall cost than any existing solution. In today’s market, you can purchase or find open source solutions for large portions of your system, including supporting infrastructure. You can buy pre-certified radio …
Continue reading “Not Invented Here Syndrome is a Business Problem”
Software is a cost, not an asset
6 December 2022 by Phillip Johnston • Last updated 28 March 2024Software companies often think of their code (or their software application) as an asset. Given one perspective, this makes sense: software has value, and you can buy it or sell it. This applies to organizations whose product is the code that they are selling. But for most teams, the code is not the product. Therefore, it is not an asset – it is a cost. The real asset is the business capability/value the software provides. Augmented by software, the business (or customer) can now do something they couldn’t do …
Continue reading “Software is a cost, not an asset”
Mature Organizations Include Quality Assurance
18 November 2022 by Phillip JohnstonA strong indicator of a mature organization is the presence of a quality assurance (QA) role, and ideally a team. Lack of QA is Common Most organizations we have encountered neglect this role, especially startups and smaller teams. With startups, the focus is on building – the idea of releasing to customers can feel impossibly far away, and thus QA can seem not important. Small teams with limited resources may take a slightly different view, focusing their limited resources on development to try to generate more fund. One common justification for the lack of dedicated …
Continue reading “Mature Organizations Include Quality Assurance”
Building Embedded Teams in the Modern Era
Continue reading "Building Embedded Teams in the Modern Era"
