std::string vs C-strings

Today we'll continue our C-to-C++ migration theme by focusing on std::string, a container-like class used to manage strings. std::string provides much more straightforward string management interfaces, allows you to utilize SBRM design patterns, and helps eliminate string management overhead.

Let's start off by reviewing built-in string support in C/C++.

Built-in String Support (C-style Strings)

Let's start off with a review of built-in string support, henceforth referred to as "C-style strings".

Neither C or C++ have a default built-in string type. C-strings are simply implemented as a char array which is terminated by a null character (aka 0). This last part of the definition is important: all C-strings are char arrays, but not all char arrays are c-strings.

C-strings of this form are called "string literals":

const char * str = "This is a string literal. See the double quotes?"

String literals are indicated by using the double quote (") and are stored as a constant (const) C-string. The null character is automatically appended at the end for your convenience.

The standard library contains functions for processing C-strings, such as strlen, strcpy, and strcat. These functions are defined in the C header string.h and in the C++ header cstring. These standard C-string functions require the strings to be terminated with a null character to function correctly.

Disadvantages of C-strings

C arrays do not track their own size. You must keep up with size on your own or rely on the linear-time strlen function to determine the size of each string during runtime. Since C has no concept of boundary protection, the use of the null character is of paramount importance: the C library functions require it, or else they operate past the bounds of the array

Working with C-strings is not intuitive. Functions are required to compare strings, and the output of the strcmp functions is not intuitive either. For functions like strcpy and strcat, the programmer is required to remember the correct argument order for each call. Inverting arguments can have a non-obvious yet negative effect.

Many C-strings are used as fixed-size arrays. This is true for literals as well as arrays that are declared in the form char str[32]. For dynamically sized strings, programmers must worry about manually allocating, resizing, and copying strings.

The concept of C-string size/length is not intuitive and commonly results in off-by-one bugs. The null character that marks the end of a C-string requires a byte of storage in the char array. This means that a string of length 24 needs to be stored in a 25-byte char array. However, the strlen function returns the length of the string without the null character. This simple fact has tripped up many programmers (including myself) when copying around memory. Eventually, you end up with a non-null-terminated string, causing the string library functions to operate out-of-bounds.

What If We Could Fix Those Disadvantages?

What if we could fix those disadvantages? What would our ideal string use-case look like? Here are some ideas:

  • Flexible storage capacity
  • Constant-time string length retrieval (rather than a linear-time functional check)
  • No need to worry about manual memory management or resizing strings
  • Boundary issues are handled for me, with or without a null character.
  • Intuitive assignment using the = operator rather than strcpy
  • Intuitive comparison using the == operator rather than strcmp
  • Intuitive interfaces for other operations such as concatenation (+ operator is nice!), tokenization

std::string

Luckily, the C++ std::string class scratches most of these itches for us. Fundamentally, you can consider std::string as a container for handling char arrays, similar to std::vector<char> with some specialized function additions.

The std::string class manages the underlying storage for you, storing your strings in a contiguous manner. You can get access to this underlying buffer using the c_str() member function, which will return a pointer to null-terminated char array. This allows std::string to interoperate with C-string APIs.

Let's take a look at using std::string.

Declaration and Assignment

Declaring a std::string object is simple:

std::string my_str;

You can also initialize it with a C-string:

std::string name("Phillip");

Or initialize it by copying another std::string object:

std::string name2(name);

Or even by making a substring out of another std::string:

std::string lip(name, 4);

There's also a "fill" constructor for std::string which allows you to populate the buffer with a repeated series of characters:

// fill the string with a char. note the single quotes
std::string filled(16, 'A');

Assigning values to a std::string is also simple, as you just need to use the = operator:

// c-string assignment
my_str = "Phillip";

// Copy assignment
my_str = filled;

// Move assignment
my_str = std::move(name2);

Isn't this so much easier than using strcpy?

Comparing strings

Comparing strings for equality using std::string is also much more intuitive, as the == operator has been overloaded for comparison:

if(my_str == name2)
{
    std::cout << "my_str and name2 match!" << std::endl;
}

The use of the == operator works as long as one of the values is a std::string. This means we can compare the std::string to a string literal:

if(my_str == "Phillip")
{
    std::cout << "my_str and \"Phillip\" match!" << std::endl;
}

You can also compare strings lexicographically using the other comparison operators (<, data-preserve-html-node="true" >):

if(string1 < string2)
{
    std::cout << "string1 comes first lexicographically" << std::endl;
}

If you're not familiar with lexicographical ordering, it is the ordering by ASCII values of the characters. In ASCII, all upper case letters come before the lower case letters, so "apple" > "Apple".

If you prefer a functional comparison interface, std::string also provides a compare function. This function is similar to strcmp:

  • 0 indicates equality
  • Positive values indicate that the second string comes first lexicographically
  • Negative values mean your string object comes first lexicographically.
if(!str1.compare(str2))
{
    std::cout << "These strings are equal" << std::endl;
}

You can also compare substrings of two different string objects. The substring is of length Y, starting at position X.

if(!str1.compare(str2, x, y))
{
    std::cout << "String 1 is equal to the substring of String 2" << std::endl;
}

Concatenating Strings

I'm sure at this point you won't be surprised: concatenating two strings is a trivial operation that involves using the + operator:

//Concatenation is also very simple!
my_str = lip + name2;
my_str += "lip"; //C-string cat works too

If you prefer a functional interface, std::string also provides an append function. Each of these functions appends something onto the end of your std::string object:

std::string my_str("test");
std::string str2("boo");
const char * c_str = "This is a c_str";

// We can append a string
my_str.append(str2);
my_str.append(c_str);

// We can append X characters from the beginning of a string
my_str.append(str2, x);
my_str.append(c_str, x);

// We can also append a substring, starting at index X and of length Y
my_str.append(str2, x, y);
my_str.append(c_str, x, y);

Accessing Characters

Similar to C-strings, std::string supports the indexing operator [] to access specific characters. Just as with C-strings and arrays, indexing starts at 0. As with other containers, the indexing operator does not support bounds checking. If you wish to have bounds checking applied, you can use the at() member function.

Other std::string Interfaces

std::string provides many other useful interfaces. I'll just provide a brief overview of functionality - full interface documentation can be found at cppreference.

For handling storage:

  • size() and length() both return the length of the std::string
    • size is provided to maintain a common interface with container classes
  • capacity() provides the current number of characters that can be held in the currently allocated storage
  • empty() returns true if a string is currently empty
  • clear() resets the container to an empty string
  • reserve() resizes the underlying storage buffer to the requested capacity
  • resize() performs a similar operation, but provides the option of filling new characters with a specific value
  • shrink_to_fit() shrinks the buffer to the current string size, freeing up unused storage capacity

For modifying strings:

  • insert() can be used to insert characters or strings at a specific position
  • replace() can be used to replace characters in a substring
  • push_back() appends a character to the end of the string
  • pop_back() removes the last character of the string
  • erase() removes specific characters

For working with substrings:

  • substr() returns a copy of the substring at the specified position
  • find() can be used to identify the first position within a string where the specified character or substring can be found
  • rfind() can be used to find the last occurrence of a substring
  • find_first_of() can be used to find the first occurrence of a substring
  • find_last_of() can be used to find the last occurrence of a substring
  • find_first_not_of() can be used to find the first absence of a substring
  • find_last_not_of() can be used to find the last absence of a substring

Remember, full documentation can be found on cppreference.com.

A Note on Avoiding Copy Overhead

Unless you want to make a copy of your std::string, you will want to avoid passing around strings by value:

void foo(std::string str);

Instead, you should pass the argument by reference if you want to modify the string:

void foo(std::string &str);

Or by const reference if the string will not be modified:

void foo(const std::string &str);

I very rarely find myself passing around std::string containers by value, since I want to avoid the unnecessary copies.

When Should I Use std::string?

Great, now we have some idea of what we can do with a std::string. When and why should I use std::string over C-strings?

Let's consider some of the advantages to using std::string:

  • Ability to utilize SBRM design patterns
  • The interfaces are much more intuitive to use, leading to less chances of messing up argument order
  • Better searching, replacement, and string manipulation functions (c.f. the cstring library)
  • The size/length functions are constant time (c.f. the linear time strlen function)
  • Reduced boilerplate by abstracting memory management and buffer resizing
  • Reduced risk of segmentation faults by utilizing iterators and the at() function
  • Compatible with STL algorithms

In general, std::string provides a modern interface for string management and will help you write much more straightforward code than C-strings. In general, prefer std::string to C-strings, but especially prefer std::string for mutable strings.

std::string Limitations

There's storage overhead involved with using a std::string object. C-strings are the simplest possible storage method for a string, making them attractive in situations where memory must be conserved. However, similar to other C++ containers, I find that this minor overhead is worth the convenience.

When utilizing a std::string, memory must be dynamically allocated and initialized during runtime. You cannot pre-allocate a std::string buffer during compile-time ands you cannot supply a pre-existing buffer for std::string to assume ownership over. Unlike std::string, C-strings can utilize compile-time allocation and determination of size. Additionally, memory allocation is handled by the std::string class itself. If you need fine-grained control of memory management, look to manual management with C-strings.

One major gripe I have with std::strings is that they don't play nicely with string literals. String literals are placed in static storage and cannot be taken over by a std::string. Initializing a std::string using a string literal will always involve a copy. C-strings still seem to be the best storage option for string literals, especially if you want to avoid unnecessary copies (such as in an embedded environment).

Putting it All Together

I've written a basic std::string example which can be found in the embedded-resources Github repository.

Further Reading