Today we have a guest post from Klemens Morgenstern, an embedded C++ consultant. You can learn more about Klemens on his website.
Table of Contents:
- Regular Expressions for Embedded C++
- A Quick Note on Versioning
- How to Get the Library
- Regex
- Summary
- Further Reading
Regular Expressions for Embedded C++
C++11 added regular expression (regex) support to the standard library. However, this implementation does not work well for embedded systems. The regex objects have the regex specification passed in the constructor call, and then they construct an underlying state machine. Not only does this require dynamic memory allocations (and not to forget runtime), but it will also throw an exception if an invalid regex is provided.
There are libraries that try to be more lightweight, e.g. tiny-regex-cpp. But even the tiny-regex-cpp library computes the regex at startup, increasing the startup-time.
But what if new language features of C++ allowed us to build our regex state machine at compile-time? Well there’s no need to wonder, since Hana Dusikova implemented a library that does it: Compile time regular expressions v2
In this article we will go through code examples and show the generated ARM assembly output using GCC 8.
A quick note on versioning
C++20 finally allows string to be passed as template parameters. But this is only supported by GCC9, while the newest release of the arm-gcc is version 8.
ctre::match<"REGEX">(subject); // C++20
"REGEX"_ctre.match(subject); // C++17 + N3599 extension
For this article we will stick to the C++17 version.
How to get the library
To use this library, download single_header/ctre.hpp from GitHub.
Regex
Regular expressions are a language to express character sequences. With the help of regular expressions you can find a substring (regex_search) or match a full string regex_match). To get familiar with regex, I highly recommend looking at Regex101.
Matching a number
Let’s consider a simple regex that matches a number:
[0-9]+
This translates to: “match any character between 0 and 9 for one or more times (+)”.
Here’s how we would write that regex expression with CTRE:
#include <ctre.hpp>
using namespace ctre::literals;
//Declare the regex
constexpr auto rx = "[0-9]+"_ctre;
//Test it matches a string at compile time
static_assert(rx.match("1234567890"));
//Test it doesn't match the wrong one
static_assert(!rx.match("foo"));
Above, we have the regex defined and invoke it at compile-time to assert it matches a number, but not a name, as expected. This is already useful, but we might want actually use the result.
So let’s match the string corresponding to the integer and then transform it into an actual int value. Note that we won’t consider overflows for brevity sake
//return std::nullopt n case it can't match
constexpr std::optional<unsigned int> to_int(std::string_view sv)
{
if (!rx.match(sv))
return std::nullopt;
int res = 0;
for (auto c : sv)
{
res = (c - '0') + res*10;
}
return res;
}
//Check it matches
static_assert(to_int("12345"));
//Check the value
static_assert(*to_int("12345") == 12345);
Now we have a function that can convert a string to int at compile time. Thus far no assembly is generated: everything is resolved by the compiler.
To generate assembly output (for demonstration purposes), we can add a non-constexpr function call:
//force the compiler to export something
auto to_int_rt(std::string_view sv)
{
return to_int(sv);
}
GCC generates the following assembly code:
to_int_rt(std::basic_string_view<char, std::char_traits<char> >):
push {r4, lr}
sub sp, sp, #8
add r3, sp, #8
stmdb r3, {r1, r2}
ldm sp, {r1, r3}
add r1, r3, r1
cmp r1, r3
beq .L2
ldrb r2, [r3] @ zero_extendqisi2
sub r2, r2, #48
cmp r2, #9
bhi .L2
mov r2, r3
sub r4, r1, #1
rsb lr, r3, #1
.L3:
cmp r4, r2
add ip, lr, r2
beq .L16
ldrb ip, [r2, #1]! @ zero_extendqisi2
sub ip, ip, #48
cmp ip, #9
bls .L3
.L2:
mov r3, #0
strb r3, [r0, #4]
add sp, sp, #8
pop {r4, pc}
.L16:
cmp ip, #0
beq .L2
mov r2, #0
.L6:
ldrb ip, [r3], #1 @ zero_extendqisi2
add r2, r2, r2, lsl #2
sub ip, ip, #48
cmp r1, r3
add r2, ip, r2, lsl #1
bne .L6
mov r3, #1
str r2, [r0]
strb r3, [r0, #4]
add sp, sp, #8
pop {r4, pc}
The fact that we can post the entire function here shows how efficient the code generation is. This regex function requires only 40 lines of assembly, without any function call or memory allocations. We can match regular expressions on the stack and we can use both runtime and compile time matching.
You can see the whole example on Compiler Explorer.
Compile time magic
With the ability of using a regex at compile time, we can make use of information that is already available in the project. Of course, this is highly dependent on the use-case, so this article can only provide very generic examples. Let’s take our previous example and enhance it for versioning analysis.
Let’s assume we use semantic versioning for our project and we have a VERSION_STRING globally defined. We want to use the version number for some logic.
In this example, we’re going to use capture groups. This allows us to get several values out of the regex. Here’s the regex for parsing our version number:
([0-9]+)\.([0-9]+)\.([0-9]+)
Here’s the compile-time function that can parse the major, minor, and patch numbers from a version string:
#include <ctre.hpp>
#include <optional>
struct version
{
int major;
int minor;
int patch;
};
constexpr std::optional<version> parse_version(std::string_view sv)
{
using namespace ctre::literals;
//Declare the regex
constexpr auto version_rx = "([0-9]+)\\.([0-9]+)\\.([0-9]+)"_ctre;
//If we use capture groups we get a tuple, so structured bindings work wonders.
const auto [match, major_str, minor_str, patch_str] = version_rx.match(sv);
//We need that when running it at compile time, BUT
if (!match)
{
return std::nullopt;
}
//C++17 constexpr lambda
auto to_int= [](std::string_view sv) constexpr
{
int res = 0;
for (auto c : sv)
res = (c - '0') + res*10;
return res;
};
//Return the integer values
return version{
to_int(major_str),
to_int(minor_str),
to_int(patch_str)
};
}
constexpr auto version = *parse_version("1.3.5");
static_assert(version.major == 1);
static_assert(version.minor == 3);
static_assert(version.patch == 5);
Of course, this function can also be used at runtime.
Let’s take our example a step further. Our version string is defined by an external tool, but we have the VERSION_STRING preprocessor variable available. We can make use of it in the code like this:
static_assert(version.major == 1, "This feature is only available in version 1");
static_assert(version.major == 1 && version.minor < 5,
"This workaround is a patch, and should be fixed properly with minor .6");
The full example is available on Compiler Explorer.
Matching a protocol
Frame detection
As a more practical example on a controller, let’s assume we have a simple protocol with a variadic frame defined in the following way:
~start frame- payload
- check_sum (4 hex chars)
#end frame
The payload itself is not defined in greater details, but we know that ~ and $ are escaped.
We can use a regex to match this simple protocol definition:
~([^~$]+)([0-9A-Fa-f]{4})#
~ //Literally match
[^~$]+ // match any character except ~$
[0-9A-Fa-f]{4} // match any hex character four times
And here’s the C++ code that we need to match a frame:
#include <ctre.hpp>
#include <optional>
using namespace ctre::literals;
constexpr auto frame_rx = "~([^~$]+)([0-9A-Fa-f]{4})#"_ctre;
auto match_frame(std::string_view sv)
{
auto [match, payload, check_sum] = frame_rx.match(sv);
return std::make_tuple(payload, check_sum);
}
If you look into the assembly, you will notice a few more calls into the CRTE library. This is because the regex needs to look ahead, since [^~$] also matches the checksum. If we don’t know the size of the payload, this behavior is unavoidable.
Fixed frame sizes
Let’s say we do know the size of the payload. We’ll assume that there are three different kinds of messages contained in the payload:
| Type | Size | Regex |
|---|---|---|
f |
6 | f.{6} |
b |
2 | b.{2} |
n |
0 | n |
Combining those with our regex, we can now precisely predict the size and remove the recursive function call:
~(f.{4}|b.{2}|n)([0-9A-Fa-f]{4})#
You can see the resulting assembly on Compiler Explorer.
Payload types
Since we now already know how to detect the payload type, why not make use of it?
We already use alternatives, so we can modify our payload-capturing group to capture each type individually:
~(?:(f.{4})|(b.{2})|(n))([0-9A-Fa-f]{4})#
Note that (?:) represents a non-capturing group.
The regex above gives us four capture groups, f, b, n and check_sum. However only one of the payload groups will contain a value, so the regex allows us to check which one.
We can use this method to obtain the payload data:
constexpr auto frame_rx = "~(?:(f.{4})|(b.{2})|(n))([0-9A-Fa-f]{4})#"_ctre;
enum class payload_type {
f, b, n
};
auto match_frame(std::string_view sv)
{
auto [match, f,b,n, check_sum] = frame_rx.match(sv);
if (f)
{
return std::make_tuple(payload_type::f, f.to_view(), check_sum);
}
else if (b)
{
return std::make_tuple(payload_type::b, b.to_view(), check_sum);
}
else
{
return std::make_tuple(payload_type::n, n.to_view(), check_sum);
}
}
This is a very basic approach for demonstration purposes. Instead of using an enumerator, each payload type could be parsed into a class and returned as a std::variant.
See the full example on Compiler Explorer.
Binary data
So far we have only used ASCII data (the library also supports Unicode). What if we are supposed to work with binary data?
Regex does allow matching arbitrary values, but it should be noted that this refers to the index of the encoding. So testing this code for is advised. If values are in the ASCII range (<128) there should be no issues.
So let’s take our former protocol, but make it binary:
STXstart frame- payload
- check_sum 2 bytes
ETXend frame
STX and ETX have to be escaped in the payload, but not in the check_sum, yielding us a bit of overhead
\x02([^\x02\x03]+)(.{2})\x03
We can extract the payload & check_sum. Transforming the binary checksum into an integer is also trivial.
constexpr auto frame_rx = "\\x02([^\\x02\\x03]+)(.{2})\\x03"_ctre;
auto match_frame(std::string_view sv)
{
auto [match, payload, check_sum_str] = frame_rx.match(sv);
unsigned char check_sum = 0;
if (check_sum_str)
{
check_sum = (check_sum_str.to_view()[0] << 4) | check_sum_str.to_view()[1];
}
return std::make_tuple(payload, check_sum);
}
It might seem strange to use regex for binary data. Regular expressions are by far the most popular pattern matching tool, and the behaviour of binary data is well defined. Since we can make use of the powerful engine that compile-time expressions give us, that surely makes it worth our while.
You can see the full example on Compiler Explorer.
Summary
With the increased support for compile-time evaluation that C++ gets in every standard revision, code optimization not only increases, but more powerful tools become available. The compile-time regular expression library allows us to use regular expressions without requiring a heap or exception support, thus making regex usable in the most constrained software environments.
The full power of compile-time string evaluation is apparent in C++20, not C++17, so it is reasonable to expect more than just regular expressions in our future. We should expect to see full parsers implemented in this way.
Further Reading
- Regex101
- CppCon 2018: Compile Time Regular Expressions
- Compile Time Regular Expressions Library
- Use Semantic Versioning and Give Your Version Numbers Meaning
- Hana Dusikova on GitHub
- C++ Regular Expressions Library
- tiny-regex-cpp
Using C++ Without the Heap
Want to use C++, but worried about how much it relies on dynamic memory allocations? Our course provides a hands-on approach for learning a diverse set of patterns, tools, and techniques for writing C++ code that never uses the heap.
Learn More on the Course Page

You can always use bleeding-edge-toolchain script to build an ARM toolchain with whatever GCC version you wish (even one of the GCC 10 snapshots).
https://github.com/FreddieChopin/bleeding-edge-toolchain
For those who use Windows there are precompiled binaries available here:
http://www.freddiechopin.info/en/download/category/11-bleeding-edge-toolchain
Therefore GCC 9 for ARM was available on the same day GCC 9 was officially released (;
This toolchain is almost identical to the version provided by ARM, but with a couple of important changes, which make it (in my opinion) more suited to use with microcontrollers. C++ exceptions and RTTI are disabled in the libstdc++ itself in all variants of the libraries (not only in the "nano" variant). Couple of newlib options are disabled, as they produce very low gain with a lot of RAM requirement (many optimizations for stdio functions), have almost no use for an embedded project (dynamic registration of atexit handlers) or just cost too much RAM ("large" variant of reent struct). On the other hand several low-cost features are enabled explicitly – for example full support for c99 formats in printf/scanf-like functions.