Embedded Cpp Series

Regular Expressions for Embedded C++

Today we have a guest post from Klemens Morgenstern, an embedded C++ consultant. You can learn more about Klemens on his website.

Table of Contents:

  1. Regular Expressions for Embedded C++
  2. A Quick Note on Versioning
  3. How to Get the Library
  4. Regex
    1. Matching a Number
    2. Compile time magic
    3. Matching a Protocol
      1. Frame Detection
      2. Fixed Frame Sizes
      3. Payload Types
    4. Binary Data
  5. Summary
  6. Further Reading

Regular Expressions for Embedded C++

C++11 added regular expression (regex) support to the standard library. However, this implementation does not work well for embedded systems. The regex objects have the regex specification passed in the constructor call, and then they construct an underlying state machine. Not only does this require dynamic memory allocations (and not to forget runtime), but it will also throw an exception if an invalid regex is provided.

There are libraries that try to be more lightweight, e.g. tiny-regex-cpp. But even the tiny-regex-cpp library computes the regex at startup, increasing the startup-time.

But what if new language features of C++ allowed us to build our regex state machine at compile-time? Well there's no need to wonder, since Hana Dusikova implemented a library that does it: Compile time regular expressions v2

In this article we will go through code examples and show the generated ARM assembly output using GCC 8.

A quick note on versioning

C++20 finally allows string to be passed as template parameters. But this is only supported by GCC9, while the newest release of the arm-gcc is version 8.

ctre::match<"REGEX">(subject); // C++20
"REGEX"_ctre.match(subject); // C++17 + N3599 extension

For this article we will stick to the C++17 version.

How to get the library

To use this library, download single_header/ctre.hpp from GitHub.

Regex

Regular expressions are a language to express character sequences. With the help of regular expressions you can find a substring (regex_search) or match a full string regex_match). To get familiar with regex, I highly recommend looking at Regex101.

Matching a number

Let's consider a simple regex that matches a number:

[0-9]+

This translates to: "match any character between 0 and 9 for one or more times (+)".

Here's how we would write that regex expression with CTRE:

#include <ctre.hpp>

using namespace ctre::literals;

//Declare the regex
constexpr auto rx = "[0-9]+"_ctre;

//Test it matches a string at compile time
static_assert(rx.match("1234567890"));

//Test it doesn't match the wrong one
static_assert(!rx.match("foo"));

Above, we have the regex defined and invoke it at compile-time to assert it matches a number, but not a name, as expected. This is already useful, but we might want actually use the result.

So let's match the string corresponding to the integer and then transform it into an actual int value. Note that we won't consider overflows for brevity sake

//return std::nullopt n case it can't match
constexpr std::optional<unsigned int> to_int(std::string_view sv)
{
    if (!rx.match(sv))
        return std::nullopt;

    int res = 0;
    for (auto c : sv)
    {
        res = (c - '0') + res*10;
    }

    return res;
}

//Check it matches
static_assert(to_int("12345"));
//Check the value
static_assert(*to_int("12345") == 12345);

Now we have a function that can convert a string to int at compile time. Thus far no assembly is generated: everything is resolved by the compiler.

To generate assembly output (for demonstration purposes), we can add a non-constexpr function call:

//force the compiler to export something
auto to_int_rt(std::string_view sv)
{
    return to_int(sv);
}

GCC generates the following assembly code:

to_int_rt(std::basic_string_view<char, std::char_traits<char> >):
        push    {r4, lr}
        sub     sp, sp, #8
        add     r3, sp, #8
        stmdb   r3, {r1, r2}
        ldm     sp, {r1, r3}
        add     r1, r3, r1
        cmp     r1, r3
        beq     .L2
        ldrb    r2, [r3]        @ zero_extendqisi2
        sub     r2, r2, #48
        cmp     r2, #9
        bhi     .L2
        mov     r2, r3
        sub     r4, r1, #1
        rsb     lr, r3, #1
.L3:
        cmp     r4, r2
        add     ip, lr, r2
        beq     .L16
        ldrb    ip, [r2, #1]!   @ zero_extendqisi2
        sub     ip, ip, #48
        cmp     ip, #9
        bls     .L3
.L2:
        mov     r3, #0
        strb    r3, [r0, #4]
        add     sp, sp, #8
        pop     {r4, pc}
.L16:
        cmp     ip, #0
        beq     .L2
        mov     r2, #0
.L6:
        ldrb    ip, [r3], #1    @ zero_extendqisi2
        add     r2, r2, r2, lsl #2
        sub     ip, ip, #48
        cmp     r1, r3
        add     r2, ip, r2, lsl #1
        bne     .L6
        mov     r3, #1
        str     r2, [r0]
        strb    r3, [r0, #4]
        add     sp, sp, #8
        pop     {r4, pc}

The fact that we can post the entire function here shows how efficient the code generation is. This regex function requires only 40 lines of assembly, without any function call or memory allocations. We can match regular expressions on the stack and we can use both runtime and compile time matching.

You can see the whole example on Compiler Explorer.

Compile time magic

With the ability of using a regex at compile time, we can make use of information that is already available in the project. Of course, this is highly dependent on the use-case, so this article can only provide very generic examples. Let's take our previous example and enhance it for versioning analysis.

Let's assume we use semantic versioning for our project and we have a VERSION_STRING globally defined. We want to use the version number for some logic.

In this example, we're going to use capture groups. This allows us to get several values out of the regex. Here's the regex for parsing our version number:

([0-9]+)\.([0-9]+)\.([0-9]+)

Here's the compile-time function that can parse the major, minor, and patch numbers from a version string:

#include <ctre.hpp>
#include <optional>

struct version
{
    int major;
    int minor;
    int patch;
};

constexpr std::optional<version> parse_version(std::string_view sv)
{
    using namespace ctre::literals;

    //Declare the regex
    constexpr auto version_rx = "([0-9]+)\\.([0-9]+)\\.([0-9]+)"_ctre;

    //If we use capture groups we get a tuple, so structured bindings work wonders.
    const auto [match, major_str, minor_str, patch_str] = version_rx.match(sv);

    //We need that when running it at compile time, BUT 
    if (!match)
    {
        return std::nullopt;
    }

    //C++17 constexpr lambda
    auto to_int= [](std::string_view sv) constexpr
         {
            int res = 0;
            for (auto c : sv)
                res = (c - '0') + res*10;
            return res;
         };

    //Return the integer values
    return version{
            to_int(major_str),
            to_int(minor_str),
            to_int(patch_str)
        };
}

constexpr auto version = *parse_version("1.3.5");
static_assert(version.major == 1);
static_assert(version.minor == 3);
static_assert(version.patch == 5);

Of course, this function can also be used at runtime.

Let's take our example a step further. Our version string is defined by an external tool, but we have the VERSION_STRING preprocessor variable available. We can make use of it in the code like this:

static_assert(version.major == 1, "This feature is only available in version 1");
static_assert(version.major == 1 && version.minor < 5, 
 "This workaround is a patch, and should be fixed properly with minor .6");

The full example is available on Compiler Explorer.

Matching a protocol

Frame detection

As a more practical example on a controller, let's assume we have a simple protocol with a variadic frame defined in the following way:

  • ~ start frame
  • payload
  • check_sum (4 hex chars)
  • # end frame

The payload itself is not defined in greater details, but we know that ~ and $ are escaped.

We can use a regex to match this simple protocol definition:

~([^~$]+)([0-9A-Fa-f]{4})#

~ //Literally match
[^~$]+ // match any character except ~$
[0-9A-Fa-f]{4} // match any hex character four times

And here's the C++ code that we need to match a frame:

#include <ctre.hpp>
#include <optional>

using namespace ctre::literals;
constexpr auto frame_rx = "~([^~$]+)([0-9A-Fa-f]{4})#"_ctre;

auto match_frame(std::string_view sv)
{
    auto [match, payload, check_sum] = frame_rx.match(sv);
    return std::make_tuple(payload, check_sum);
}

If you look into the assembly, you will notice a few more calls into the CRTE library. This is because the regex needs to look ahead, since [^~$] also matches the checksum. If we don't know the size of the payload, this behavior is unavoidable.

Fixed frame sizes

Let's say we do know the size of the payload. We'll assume that there are three different kinds of messages contained in the payload:

Type Size Regex
f 6 f.{6}
b 2 b.{2}
n 0 n

Combining those with our regex, we can now precisely predict the size and remove the recursive function call:

~(f.{4}|b.{2}|n)([0-9A-Fa-f]{4})#

You can see the resulting assembly on Compiler Explorer.

Payload types

Since we now already know how to detect the payload type, why not make use of it?

We already use alternatives, so we can modify our payload-capturing group to capture each type individually:

~(?:(f.{4})|(b.{2})|(n))([0-9A-Fa-f]{4})#

Note that (?:) represents a non-capturing group.

The regex above gives us four capture groups, f, b, n and check_sum. However only one of the payload groups will contain a value, so the regex allows us to check which one.

We can use this method to obtain the payload data:

constexpr auto frame_rx = "~(?:(f.{4})|(b.{2})|(n))([0-9A-Fa-f]{4})#"_ctre;

enum class payload_type {
    f, b, n
};

auto match_frame(std::string_view sv)
{
    auto [match, f,b,n, check_sum] = frame_rx.match(sv);
    if (f)
    {
        return std::make_tuple(payload_type::f, f.to_view(), check_sum);
    }
    else if (b)
    {
        return std::make_tuple(payload_type::b, b.to_view(), check_sum);
    }
    else
    {
        return std::make_tuple(payload_type::n, n.to_view(), check_sum);
    }
}

This is a very basic approach for demonstration purposes. Instead of using an enumerator, each payload type could be parsed into a class and returned as a std::variant.

See the full example on Compiler Explorer.

Binary data

So far we have only used ASCII data (the library also supports Unicode). What if we are supposed to work with binary data?

Regex does allow matching arbitrary values, but it should be noted that this refers to the index of the encoding. So testing this code for is advised. If values are in the ASCII range (<128) data-preserve-html-node="true" there should be no issues.

So let's take our former protocol, but make it binary:

  • STX start frame
  • payload
  • check_sum 2 bytes
  • ETX end frame

STX and ETX have to be escaped in the payload, but not in the check_sum, yielding us a bit of overhead

\x02([^\x02\x03]+)(.{2})\x03

We can extract the payload & check_sum. Transforming the binary checksum into an integer is also trivial.

constexpr auto frame_rx = "\\x02([^\\x02\\x03]+)(.{2})\\x03"_ctre;
auto match_frame(std::string_view sv)
{
    auto [match, payload, check_sum_str] = frame_rx.match(sv);
    unsigned char check_sum = 0;

    if (check_sum_str)
    {
        check_sum = (check_sum_str.to_view()[0] << 4) | check_sum_str.to_view()[1];
    }

    return std::make_tuple(payload, check_sum);
}

It might seem strange to use regex for binary data. Regular expressions are by far the most popular pattern matching tool, and the behaviour of binary data is well defined. Since we can make use of the powerful engine that compile-time expressions give us, that surely makes it worth our while.

You can see the full example on Compiler Explorer.

Summary

With the increased support for compile-time evaluation that C++ gets in every standard revision, code optimization not only increases, but more powerful tools become available. The compile-time regular expression library allows us to use regular expressions without requiring a heap or exception support, thus making regex usable in the most constrained software environments.

The full power of compile-time string evaluation is apparent in C++20, not C++17, so it is reasonable to expect more than just regular expressions in our future. We should expect to see full parsers implemented in this way.

Further Reading

SaferC++ Provides Safer Native Type Implementations

The SaferC++ library provides safer implementations for many native C++ types. The library provides features such as:

  • Data types that are designed for multi-threaded use and asynchronous access
  • Drop-in replacements for std::vector, std::array, std::string, std::string_view that provide improved memory safety
  • Drop-in replacements for int, size_t, and bool that protect against use of uninitialized values and sign-unsigned comparison issues (similar to type_safe)
  • Improved pointer & reference types with different compatibility and performance tradeoffs

SaferC++ is usable with embedded systems as long as your platform has a functional STL implementation. Exception behavior can be controlled for your platform by modifying the MSE_CUSTOM_THROW_DEFINITION macro.

Using the library does incur a performance penalty. However, SaferC++ elements can be disabled during compile time (i.e. replaced with the standard type equivalents). This allows users to enable debug and test builds to use safer-but-slower features without adding overhead to release builds.

Since the SaferC++ types provide added safety and can be disabled when performance matters, I highly recommend using their drop-in types to catch and eliminate possible errors when using STL types. The easiest way to get started with SaferC++ is to utilize the mse::vector and mse::array types in place of std::vector and std::array. These types will help you catch potential memory issues lurking in your software. The README provides further tips for making your code safer.

Further Reading

For more on SaferC++:

nothrow new: the Variant to Use When Avoiding C++ Exceptions

I'm a relatively young C++ developer, having spent the majority of my career programming in C. While I've devoted myself to the study of the language, I still frequently learn about new features and capabilities that surprise me.

I recently learned about nothrow new, a version of the new operator which returns a nullptr on failure instead of throwing an exception. While I have never seen this new operator usage in the wild, it's a valuable addition to my toolkit, as I often write embedded C++ programs without exceptions enabled.

In order to use the std::nothrow variants of new, you need to include the <new> header. Then you can overload new with std::nothrow:

int32_t* buffer = new (std::nothrow) int[100000000ul]; //non-throwing overload

If your system cannot allocate the memory, new will return nullptr rather than throwing std::bad_alloc.

Further Reading