Firmware update support is an essential capability for contemporary embedded devices, regardless of whether or not they are connected to the internet on a regular basis. Firmware updates are used to add new capabilities after launch, correct errors, and address security vulnerabilities. Firmware update support also ensures that devices can remain useful for a longer period of time, as the development team can respond to changes in the operating environment and customer expectations.
Supporting firmware updates for your system requires a number of supporting device-side and infrastructure capabilities. Update reliability is also significantly aided by adopting supporting processes in your organization.
Firmware updates are a good example of Software Engineering as applied to embedded systems. We have to carefully design the update mechanism, account for failure modes, and make tradeoffs based on our system’s design goals and constraints.
Table of Contents:
- Device Capabilities
- Infrastructure Capabilities
- Supporting Processes
- Accounting for the Possibility of Failure
- Sub-topics and Variations
- Case Studies of Update-related Problems and Vulnerabilities
- Related Blog Posts
- References
Device Capabilities
Required
- Device software is split into multiple images.
- At a minimum, you will need to split the device into a Bootloader and Application.
- Software may be further refined depending on reliability and update schemes, such as into a Loader, a distinct Updater, or a Fallback Image
- Fail-safe support in case of an update failure or bad update
- An integrity check, ensuring that the provided binary has not been corrupted during checksum
- The device can report its software version
- The update mechanism is resilient against power and network loss during the update process. This is necessary to avoid bricking devices!
- Ideally, updates will be “atomic” and
- A method for receiving firmware updates (whether via USB connection, SD card, or Over-the-Air)
Recommended
- Code signing support, which is used to verify both provenance and integrity of an update
- Support for rolling back to a previous version on command (many implementations only allow you to increase the version)
- Version data storage, schemas, and communication protocols to support data migrations in response to an update process
- Ability to specify pre- and post-update actions (e.g., a script) in addition to the firmware update
- This can be extremely useful for supporting actions like data migrations or file removals, which might need to be executed after an update has completed.
- This can be useful for implementing post-update sanity checks to make sure that the update processes completed successfully. If the checks do not pass, roll back to the previous version.
Infrastructure Capabilities
Required
- The build system must produce unique software versions
- Store and index software build artifacts
- A mechanism for pushing a new update or indicating that a new update is available
Recommended
- Minimally, produce a checksum that can be used to verify the binary was transferred without error. Ideally, code signing will be used instead, as you can also verify that the update is coming from an authorized source.
- Cohort binning of devices enables you to control which devices receive specific firmware updates.
- This is useful for deploying beta builds to an interested population of beta testers.
- This is also a common way of implementing staged rollouts.
- Staged rollouts of firmware updates provides a safer update mechanism than a “deploy to everyone” approach. You start with a small population of devices to make sure that the update succeeds and does not introduce significant new issues. If everything looks good, you continue to roll out the update to increasingly large segments of your population.
- Ability to roll back firmware to a previous version in the event of a bad update
- Check-in and heartbeat messages are useful for determining:
- Whether or not an update was successful (a device will check in with a new firmware version)
- The distribution of versions throughout the fleet
- Often, teams are surprised to realize that there’s a distribution of versions, even when a new OTA update is released. Also, you will find that some devices never update.
Supporting Processes
- Exclusive use of the customer-facing firmware update mechanism to ensure its reliability
- Many teams leave OTA updates to the end of the project, for example, which is far too late in the process to ensure reliability. A better approach is implementing OTA updates first, and then requiring all development and internal testing to use OTA updates rather than JTAG or USB. This way, the update mechanisms receive significant mileage, and the kinks are worked out before the product is released to customers.
- Significant testing of the update process, especially with the use of fault injection to ensure that fallbacks and fail-safes work as intended
- Version Data Storage, Protocols, and Schemas
- Data migrations are a common challenge that you will need to deal with when updating devices. For example, you might update the “device settings” layout, change a communication protocol, or update an sqlite database schema.
- Without versioning these items, you cannot safely perform a migration as part of the firmware update process.
Accounting for the Possibility of Failure
Firmware updates can go wrong in many ways. It is important to ensure that your update system is resilient to all of these failures. After all, if you brick devices, you cannot remotely fix them.
To protect against data corruption during the transfer, you should compare the received contents against a checksum to ensure the integrity of the update. Ideally, however, you will use code signing to provide both an integrity check and an assurance that the build comes from an approved source.
Updates should, ideally, be atomic: either the whole update is applied, or no update occurs at all. This is especially important in guarding against corruption due to loss of network connectivity or loss of power during an update.The most common approach to atomic updates is to have dual application partitions in device storage, which we will call “A” and “B”. This approach is akin to the common “double buffering” pattern.
- The bootloader will boot from partition A, which is currently active.
- When an update is received, the contents will be placed into partition B.
- If the update process fails for any reason, the bootloader will continue to boot from partition A.
- If the update succeeds, the bootloader will boot from partition B.
- If there is a problem identified during the boot process, this can be indicated to the bootloader, which can automatically fall back to partition A
- When the next update is received, it will be placed into partition A.
However, memory and storage constraints can make atomic updates difficult to achieve with many embedded devices.
- RAM may be sufficiently limited such that an entire update payload cannot be received before being applied, but must be streamed to flash instead.
- This means that the contents of flash could be overwritten with data before it can be determined that the checksum or signature matches the expected value.
- Flash may be sufficiently limited such that there is not space for a bootloader, two complete applications, and other artifacts.
In cases where the dual partition approach cannot be used, we will create a “fallback” application, which effectively takes the place of the second partition:
- Updates will always be placed into the main application slot.
- If an update fails, or some problem is identified during the application boot process, the bootloader will automatically boot into the fallback application.
- The fallback application contains only the minimal amount of support to configure the processor and its components so that it can connect to the server and request a new update. This allows it to be much smaller than a complete second application.
This fallback firmware must be heavily tested so that it can be trusted to restore a system to a working state in the event of an update failure. Ideally, it will not need any updates once the device is deployed, as there is no fallback in place when the fallback firmware update fails.
If you cannot support any of these schemes with your current resources, you will need to add more storage or avoid OTA updates completely. You run the risk of a power or network failure completely disabling your devices in the field. Wired updates are less sensitive here, as long as you provide a tool that can be used to re-flash the device from a corrupted state (e.g., a DFU utility).
Sub-topics and Variations
- Over-the-Air Update [OTA] refers to sending updates to a system wirelessly
- Delta Update is an optimization of a full firmware update process that only involves transmitting pieces of the program that have changed. It is useful for systems with limited communication bandwidth or high communication costs.
- The Update Nightmare: Bricking Devices in the Field looks at real-world examples of failed updates as a motivation to support Staged Rollouts.
Case Studies of Update-related Problems and Vulnerabilities
- Another vulnerability in the LPC55S69 ROM / Oxide describes a problem in the LPC55S69 In-System Programming code for the signing mechanism, which allows an attacker to gain non-persistent code execution with a carefully crafted update regardless of whether the update is signed
Related Blog Posts
- Q&A: Where Should Firmware Update Capabilities Live?
- Q&A: How Do You Manage Updates that Introduce Incompatible Changes?
- Over-the-Air Updates and Fleet Management at Scale
