Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Modern C++ Programming Cookbook - Second Edition
Modern C++ Programming Cookbook - Second Edition

Modern C++ Programming Cookbook: Master C++ core language and standard library features, with over 100 recipes, updated to C++20, Second Edition

eBook
R$80.00 R$423.99
Print
R$529.99
Subscription
Free Trial
Renews at R$50p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon AI Assistant (beta) to help accelerate your learning
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Table of content icon View table of contents Preview book icon Preview Book

Modern C++ Programming Cookbook - Second Edition

Working with Numbers and Strings

Numbers and strings are the fundamental types of any programming language; all other types are based on or composed of these ones. Developers are confronted all the time with tasks such as converting between numbers and strings, parsing and formatting strings, and generating random numbers. This chapter is focused on providing useful recipes for these common tasks using modern C++ language and library features.

The recipes included in this chapter are as follows:

  • Converting between numeric and string types
  • Limits and other properties of numeric types
  • Generating pseudo-random numbers
  • Initializing all the bits of the internal state of a pseudo-random number generator
  • Creating cooked user-defined literals
  • Creating raw, user-defined literals
  • Using raw string literals to avoid escaping characters
  • Creating a library of string helpers
  • Verifying the format of a string using regular expressions
  • Parsing the content of a string using regular expressions
  • Replacing the content of a string using regular expressions
  • Using std::string_view instead of constant string references
  • Formatting text with std::format
  • Using std::format with user-defined types

Let's start this chapter by looking at a very common problem developers face on a daily basis, which is converting between numeric and string types.

Converting between numeric and string types

Converting between number and string types is a ubiquitous operation. Prior to C++11, there was little support for converting numbers to strings and back, so developers had to resort mostly to type-unsafe functions, and they usually wrote their own utility functions in order to avoid writing the same code over and over again. With C++11, the standard library provides utility functions for converting between numbers and strings. In this recipe, you will learn how to convert between numbers and strings and the other way around using modern C++ standard functions.

Getting ready

All the utility functions mentioned in this recipe are available in the <string> header.

How to do it...

Use the following standard conversion functions when you need to convert between numbers and strings:

  • To convert from an integer or floating-point type to a string type, use std::to_string() or std::to_wstring(), as shown in the following code snippet:
    auto si = std::to_string(42);      // si="42"
    auto sl = std::to_string(42L);     // sl="42"
    auto su = std::to_string(42u);     // su="42"
    auto sd = std::to_wstring(42.0);   // sd=L"42.000000"
    auto sld = std::to_wstring(42.0L); // sld=L"42.000000"
    
  • To convert from a string type to an integer type, use std::stoi(), std::stol(), std::stoll(), std::stoul(), or std::stoull(), as shown in the following code snippet:
    auto i1 = std::stoi("42");                 // i1 = 42
    auto i2 = std::stoi("101010", nullptr, 2); // i2 = 42
    auto i3 = std::stoi("052", nullptr, 8);    // i3 = 42
    auto i4 = std::stoi("0x2A", nullptr, 16);  // i4 = 42
    
  • To convert from a string type to a floating-point type, use std::stof(), std::stod(), or std::stold(), as shown in the following code snippet:
    // d1 = 123.45000000000000
    auto d1 = std::stod("123.45");
    // d2 = 123.45000000000000
    auto d2 = std::stod("1.2345e+2");
    // d3 = 123.44999980926514
    auto d3 = std::stod("0xF.6E6666p3");
    

How it works...

To convert an integral or floating-point type to a string type, you can use either the std::to_string() function (which converts to a std::string) or the std::to_wstring() function (which converts to a std::wstring). These functions are available in the <string> header and have overloads for signed and unsigned integer and real types. They produce the same result as std::sprintf() and std::swprintf() would produce when called with the appropriate format specifier for each type. The following code snippet lists all the overloads of these two functions:

std::string to_string(int value);
std::string to_string(long value);
std::string to_string(long long value);
std::string to_string(unsigned value);
std::string to_string(unsigned long value);
std::string to_string(unsigned long long value);
std::string to_string(float value);
std::string to_string(double value);
std::string to_string(long double value);
std::wstring to_wstring(int value);
std::wstring to_wstring(long value);
std::wstring to_wstring(long long value);
std::wstring to_wstring(unsigned value);
std::wstring to_wstring(unsigned long value);
std::wstring to_wstring(unsigned long long value);
std::wstring to_wstring(float value);
std::wstring to_wstring(double value);
std::wstring to_wstring(long double value);

When it comes to the opposite conversion, there is an entire set of functions that have a name with the format ston (string to number), where n stands for i (integer), l (long), ll (long long), ul (unsigned long), or ull (unsigned long long). The following list shows all these functions, each of them with two overloads—one that takes an std::string and one that takes an std::wstring as the first parameter:

int stoi(const std::string& str, std::size_t* pos = 0,
         int base = 10);
int stoi(const std::wstring& str, std::size_t* pos = 0,
         int base = 10);
long stol(const std::string& str, std::size_t* pos = 0,
         int base = 10);
long stol(const std::wstring& str, std::size_t* pos = 0,
         int base = 10);
long long stoll(const std::string& str, std::size_t* pos = 0,
                int base = 10);
long long stoll(const std::wstring& str, std::size_t* pos = 0,
                int base = 10);
unsigned long stoul(const std::string& str, std::size_t* pos = 0,
                    int base = 10);
unsigned long stoul(const std::wstring& str, std::size_t* pos = 0,
                    int base = 10);
unsigned long long stoull(const std::string& str,
                          std::size_t* pos = 0, int base = 10);
unsigned long long stoull(const std::wstring& str,
                          std::size_t* pos = 0, int base = 10);
float       stof(const std::string& str, std::size_t* pos = 0);
float       stof(const std::wstring& str, std::size_t* pos = 0);
double      stod(const std::string& str, std::size_t* pos = 0);
double      stod(const std::wstring& str, std::size_t* pos = 0);
long double stold(const std::string& str, std::size_t* pos = 0);
long double stold(const std::wstring& str, std::size_t* pos = 0);

The way the string to integral type functions work is by discarding all white spaces before a non-whitespace character, then taking as many characters as possible to form a signed or unsigned number (depending on the case), and then converting that to the requested integral type (stoi() will return an integer, stoul() will return an unsigned long, and so on). In all the following examples, the result is the integer 42, except for the last example, where the result is -42:

auto i1 = std::stoi("42");             // i1 = 42
auto i2 = std::stoi("   42");          // i2 = 42
auto i3 = std::stoi("   42fortytwo");  // i3 = 42
auto i4 = std::stoi("+42");            // i4 = 42
auto i5 = std::stoi("-42");            // i5 = -42

A valid integral number may consist of the following parts:

  • A sign, plus (+) or minus (-) (optional)
  • Prefix 0 to indicate an octal base (optional)
  • Prefix 0x or 0X to indicate a hexadecimal base (optional)
  • A sequence of digits

The optional prefix 0 (for octal) is applied only when the specified base is 8 or 0. Similarly, the optional prefix 0x or 0X (for hexadecimal) is applied only when the specified base is 16 or 0.

The functions that convert a string to an integer have three parameters:

  • The input string.
  • A pointer that, when not null, will receive the number of characters that were processed. This can include any leading whitespaces that were discarded, the sign, and the base prefix, so it should not be confused with the number of digits the integral value has.
  • A number indicating the base; by default, this is 10.

The valid digits in the input string depend on the base. For base 2, the only valid digits are 0 and 1; for base 5, they are 01234. For base 11, the valid digits are 0-9 and the characters A and a. This continues until we reach base 36, which has the valid characters 0-9, A-Z, and a-z.

The following are additional examples of strings with numbers in various bases converted to decimal integers. Again, in all cases, the result is either 42 or -42:

auto i6 = std::stoi("052", nullptr, 8);
auto i7 = std::stoi("052", nullptr, 0);
auto i8 = std::stoi("0x2A", nullptr, 16);
auto i9 = std::stoi("0x2A", nullptr, 0);
auto i10 = std::stoi("101010", nullptr, 2);
auto i11 = std::stoi("22", nullptr, 20);
auto i12 = std::stoi("-22", nullptr, 20);
auto pos = size_t{ 0 };
auto i13 = std::stoi("42", &pos);      // pos = 2
auto i14 = std::stoi("-42", &pos);     // pos = 3
auto i15 = std::stoi("  +42dec", &pos);// pos = 5

An important thing to note is that these conversion functions throw an exception if the conversion fails. There are two exceptions that can be thrown:

  • std::invalid_argument: If the conversion cannot be performed:
    try
    {
      auto i16 = std::stoi("");
    }
    catch (std::exception const & e)
    {
      // prints "invalid stoi argument"
      std::cout << e.what() << '\n';
    }
    
  • std::out_of_range: If the converted value is outside the range of the result type (or if the underlying function sets errno to ERANGE):
    try
    {
      // OK
      auto i17 = std::stoll("12345678901234");
      // throws std::out_of_range
      auto i18 = std::stoi("12345678901234");
    }
    catch (std::exception const & e)
    {
      // prints "stoi argument out of range"
      std::cout << e.what() << '\n';
    }
    

The other set of functions that convert a string to a floating-point type is very similar, except that they don't have a parameter for the numeric base. A valid floating-point value can have different representations in the input string:

  • Decimal floating-point expression (optional sign, sequence of decimal digits with optional point, optional e or E, followed by exponent with optional sign).
  • Binary floating-point expression (optional sign, 0x or 0X prefix, sequence of hexadecimal digits with optional point, optional p or P, followed by exponent with optional sign).
  • Infinity expression (optional sign followed by case-insensitive INF or INFINITY).
  • A non-number expression (optional sign followed by case-insensitive NAN and possibly other alphanumeric characters).

The following are various examples of converting strings to doubles:

auto d1 = std::stod("123.45");         // d1 =  123.45000000000000
auto d2 = std::stod("+123.45");        // d2 =  123.45000000000000
auto d3 = std::stod("-123.45");        // d3 = -123.45000000000000
auto d4 = std::stod("  123.45");       // d4 =  123.45000000000000
auto d5 = std::stod("  -123.45abc");   // d5 = -123.45000000000000
auto d6 = std::stod("1.2345e+2");      // d6 =  123.45000000000000
auto d7 = std::stod("0xF.6E6666p3");   // d7 =  123.44999980926514
auto d8 = std::stod("INF");            // d8 = inf
auto d9 = std::stod("-infinity");      // d9 = -inf
auto d10 = std::stod("NAN");           // d10 = nan
auto d11 = std::stod("-nanabc");       // d11 = -nan

The floating-point base 2 scientific notation, seen earlier in the form 0xF.6E6666p3, is not the topic of this recipe. However, for a clear understanding, a short description is provided; but it is recommended that you look at additional references for details (such as https://en.cppreference.com/w/cpp/language/floating_literal). A floating-point constant in the base 2 scientific notation is composed of several parts:

  • The hexadecimal prefix 0x.
  • An integer part, which in this example was F, which in decimal is 15.
  • A fractional part, which in this example was 6E6666, or 011011100110011001100110 in binary. To convert that into decimal, we need to add inverse powers of two: 1/4 + 1/8 + 1/32 + 1/64 + 1/128 + ....
  • A suffix, representing a power of 2; in this example, p3 means 2 at the power of 3.

The value of the decimal equivalent is determined by multiplying the significant (composed of the integer and fractional parts) and the base at the power of the exponent.

For the given hexadecimal base 2 floating-point literal, the significant is 15.4312499... (please note that digits after the seventh one are not shown), the base is 2, and the exponent is 3. Therefore, the result is 15.4212499... * 8, which is 123.44999980926514.

See also

  • Limits and other properties of numeric types to learn about the minimum and maximum values, as well as the other properties of numerical types

Limits and other properties of numeric types

Sometimes, it is necessary to know and use the minimum and maximum values that can be represented with a numeric type, such as char, int, or double. Many developers use standard C macros for this, such as CHAR_MIN/CHAR_MAX, INT_MIN/INT_MAX, and DBL_MIN/DBL_MAX. C++ provides a class template called numeric_limits with specializations for every numeric type that enables you to query the minimum and maximum value of a type. However, numeric_limits is not limited to that functionality, and offers additional constants for type property querying, such as whether a type is signed or not, how many bits it needs for representing its values, whether it can represent infinity for floating-point types, and many others. Prior to C++11, the use of numeric_limits<T> was limited because it could not be used in places where constants were needed (examples include the size of arrays and switch cases). Due to that, developers preferred to use C macros throughout their code. In C++11, that is no longer the case, as all the static members of numeric_limits<T> are now constexpr, which means they can be used everywhere a constant expression is expected.

Getting ready

The numeric_limits<T> class template is available in the namespace std in the <limits> header.

How to do it...

Use std::numeric_limits<T> to query various properties of a numeric type T:

  • Use the min() and max() static methods to get the smallest and largest finite numbers of a type. The following are examples of how these could be used:
    template<typename T, typename Iter>
    T minimum(Iter const start, Iter const end) // finds the
                                                // minimum value
                                                // in a range
    {
      T minval = std::numeric_limits<T>::max();
      for (auto i = start; i < end; ++i)
      {
        if (*i < minval)
          minval = *i;
      }
      return minval;
    }
    int range[std::numeric_limits<char>::max() + 1] = { 0 };
    switch(get_value())
    {
      case std::numeric_limits<int>::min():
      // do something
      break;
    }
    
  • Use other static methods and static constants to retrieve other properties of a numeric type. In the following example, the variable bits is an std::bitset object that contains a sequence of bits that are necessary to represent the numerical value represented by the variable n (which is an integer):
    auto n = 42;
    std::bitset<std::numeric_limits<decltype(n)>::digits>
       bits { static_cast<unsigned long long>(n) };
    

In C++11, there is no limitation to where std::numeric_limits<T> can be used; therefore, preferably, use it over C macros in your modern C++ code.

How it works...

The std::numeric_limits<T> class template enables developers to query properties of numeric types. Actual values are available through specializations, and the standard library provides specializations for all the built-in numeric types (char, short, int, long, float, double, and so on). In addition, third parties may provide additional implementations for other types. An example could be a numeric library that implements a bigint integer type and a decimal type and provides specializations of numeric_limits for these types (such as numeric_limits<bigint> and numeric_limits<decimal>).

The following specializations of numeric types are available in the <limits> header. Note that specializations for char16_t and char32_t are new in C++11; the others were available previously. Apart from the specializations listed ahead, the library also includes specializations for every cv-qualified version of these numeric types, and they are identical to the unqualified specialization. For example, consider the type int; there are four actual specializations (and they are identical): numeric_limits<int>, numeric_limits<const int>, numeric_limits<volatile int>, and numeric_limits<const volatile int>:

template<> class numeric_limits<bool>;
template<> class numeric_limits<char>;
template<> class numeric_limits<signed char>;
template<> class numeric_limits<unsigned char>;
template<> class numeric_limits<wchar_t>;
template<> class numeric_limits<char16_t>;
template<> class numeric_limits<char32_t>;
template<> class numeric_limits<short>;
template<> class numeric_limits<unsigned short>;
template<> class numeric_limits<int>;
template<> class numeric_limits<unsigned int>;
template<> class numeric_limits<long>;
template<> class numeric_limits<unsigned long>;
template<> class numeric_limits<long long>;
template<> class numeric_limits<unsigned long long>;
template<> class numeric_limits<float>;
template<> class numeric_limits<double>;
template<> class numeric_limits<long double>;

As mentioned earlier, in C++11, all static members of std::numeric_limits are constexpr, which means they can be used in all the places where constant expressions are needed. These have several major advantages over C++ macros:

  • They are easier to remember, as the only thing you need to know is the name of the type, which you should know anyway, and not countless names of macros.
  • They support types that are not available in C, such as char16_t and char32_t.
  • They are the only possible solutions for templates where you don't know the type.
  • Minimum and maximum are only two of the various properties of types it provides; therefore, its actual use is beyond the numeric limits shown. As a side note, for this reason, the class should have been perhaps called numeric_properties, instead of numeric_limits.

The following function template, print_type_properties(), prints the minimum and maximum finite values of the type, as well as other information:

template <typename T>
void print_type_properties()
{
  std::cout
    << "min="
    << std::numeric_limits<T>::min()        << '\n'
    << "max="
    << std::numeric_limits<T>::max()        << '\n'
    << "bits="
    << std::numeric_limits<T>::digits       << '\n'
    << "decdigits="
    << std::numeric_limits<T>::digits10     << '\n'
    << "integral="
    << std::numeric_limits<T>::is_integer   << '\n'
    << "signed="
    << std::numeric_limits<T>::is_signed    << '\n'
    << "exact="
    << std::numeric_limits<T>::is_exact     << '\n'
    << "infinity="
    << std::numeric_limits<T>::has_infinity << '\n';
}

If we call the print_type_properties() function for unsigned short, int, and double, we will get the following output:

unsigned short

int

double

min=0
max=65535
bits=16
decdigits=4
integral=1
signed=0
exact=1
infinity=0
min=-2147483648
max=2147483647
bits=31
decdigits=9
integral=1
signed=1
exact=1
infinity=0
min=2.22507e-308
max=1.79769e+308
bits=53
decdigits=15
integral=0
signed=1
exact=0
infinity=1

Please note that there is a difference between the digits and digits10 constants:

  • digits represents the number of bits (excluding the sign bit if present) and padding bits (if any) for integral types and the number of bits of the mantissa for floating-point types.
  • digits10 is the number of decimal digits that can be represented by a type without a change. To understand this better, let's consider the case of unsigned short. This is a 16-bit integral type. It can represent numbers between 0 and 65,536. It can represent numbers up to five decimal digits, 10,000 to 65,536, but it cannot represent all five decimal digit numbers, as numbers from 65,537 to 99,999 require more bits. Therefore, the largest numbers that it can represent without requiring more bits have four decimal digits (numbers from 1,000 to 9,999). This is the value indicated by digits10. For integral types, it has a direct relationship to constant digits; for an integral type, T, the value of digits10 is std::numeric_limits<T>::digits * std::log10(2).

It's worth mentioning that the standard library types that are aliases of arithmetic types (such as std::size_t) may also be inspected with std::numeric_limits. On the other hand, other standard types that are not arithmetic types, such as std::complex<T> or std::nullptr_t, do not have std::numeric_limits specializations.

See also

  • Converting between numeric and string types to learn how to convert between numbers and strings

Generating pseudo-random numbers

Generating random numbers is necessary for a large variety of applications, from games to cryptography, from sampling to forecasting. However, the term random numbers is not actually correct, as the generation of numbers through mathematical formulas is deterministic and does not produce true random numbers, but numbers that look random and are called pseudo-random. True randomness can only be achieved through hardware devices, based on physical processes, and even that can be challenged as we may consider even the universe to be actually deterministic. Modern C++ provides support for generating pseudo-random numbers through a pseudo-random number library containing number generators and distributions. Theoretically, it can also produce true random numbers, but in practice, those could actually be only pseudo-random.

Getting ready

In this recipe, we'll discuss the standard support for generating pseudo-random numbers. Understanding the difference between random and pseudo-random numbers is key. True random numbers are numbers that cannot be predicted better than by random chance, and are produced with the help of hardware random number generators. Pseudo-random numbers are numbers produced with the help of algorithms that generate sequences with properties that approximate the ones of true random numbers.

Furthermore, being familiar with various statistical distributions is a plus. It is mandatory, though, that you know what a uniform distribution is, because all engines in the library produce numbers that are uniformly distributed. Without going into any details, we will just mention that uniform distribution is a probability distribution that is concerned with events that are equally likely to occur (within certain bounds).

How to do it...

To generate pseudo-random numbers in your application, you should perform the following steps:

  1. Include the header <random>:
    #include <random>
    
  2. Use an std::random_device generator for seeding a pseudo-random engine:
    std::random_device rd{};
    
  3. Use one of the available engines for generating numbers and initialize it with a random seed:
    auto mtgen = std::mt19937{ rd() };
    
  4. Use one of the available distributions for converting the output of the engine to one of the desired statistical distributions:
    auto ud = std::uniform_int_distribution<>{ 1, 6 };
    
  5. Generate the pseudo-random numbers:
    for(auto i = 0; i < 20; ++i)
      auto number = ud(mtgen);
    

How it works...

The pseudo-random number library contains two types of components:

  • Engines, which are generators of random numbers; these can produce either pseudo-random numbers with a uniform distribution or, if available, actual random numbers.
  • Distributions that convert the output of an engine to a statistical distribution.

All engines (except for random_device) produce integer numbers in a uniform distribution, and all engines implement the following methods:

  • min(): This is a static method that returns the minimum value that can be produced by the generator.
  • max(): This is a static method that returns the maximum value that can be produced by the generator.
  • seed(): This initializes the algorithm with a start value (except for random_device, which cannot be seeded).
  • operator(): This generates a new number uniformly distributed between min() and max().
  • discard(): This generates and discards a given number of pseudo-random numbers.

The following engines are available:

  • linear_congruential_engine: This is a linear congruential generator that produces numbers using the following formula:

    x(i) = (A * x(i – 1) + C) mod M

  • mersenne_twister_engine: This is a Mersenne twister generator that keeps a value on W * (N – 1) * R bits. Each time a number needs to be generated, it extracts W bits. When all the bits have been used, it twists the large value by shifting and mixing the bits so that it has a new set of bits to extract from.
  • subtract_with_carry_engine: This is a generator that implements a subtract with carry algorithm based on the following formula:

    x(i) = (x(iR) – x(iS) – cy(i – 1)) mod M

    In the preceding formula, cy is defined as:

In addition, the library provides engine adapters that are also engines wrapping another engine and producing numbers based on the output of the base engine. Engine adapters implement the same methods mentioned earlier for the base engines. The following engine adapters are available:

  • discard_block_engine: A generator that, from every block of P numbers generated by the base engine, keeps only R numbers, discarding the rest.
  • independent_bits_engine: A generator that produces numbers with a different number of bits than the base engine.
  • shuffle_order_engine: A generator that keeps a shuffled table of K numbers produced by the base engine and returns numbers from this table, replacing them with numbers generated by the base engine.

Choosing a pseudo-random number generator should be done based on the specific requirements of your application. The linear congruential engine is medium fast but has very small storage requirements for its internal state. The subtract with carry engine is very fast, including on machines that don't have a processor with advanced arithmetic instructions set. However, it requires larger storage for its internal state and the sequence of generated numbers has fewer desirable characteristics. The Mersenne twister is the slowest of these engines and has the greatest storage durations, but produces the longest non-repeating sequences of pseudo-numbers.

All these engines and engine adaptors produce pseudo-random numbers. The library, however, provides another engine called random_device that is supposed to produce non-deterministic numbers, but this is not an actual constraint as physical sources of random entropy might not be available. Therefore, implementations of random_device could actually be based on a pseudo-random engine. The random_device class cannot be seeded like the other engines and has an additional method called entropy() that returns the random device entropy, which is 0 for a deterministic generator and nonzero for a non-deterministic generator.

However, this is not a reliable method for determining whether the device is actually deterministic or non-deterministic. For instance, both GNU libstdc++ and LLVM libc++ implement a non-deterministic device, but return 0 for entropy. On the other hand, VC++ and boost.random return 32 and 10, respectively, for entropy.

All these generators produce integers in a uniform distribution. This is, however, only one of the many possible statistical distributions where random numbers are needed in most applications. To be able to produce numbers (either integer or real) in other distributions, the library provides several classes called distributions. These convert the output of an engine according to the statistical distribution it implements. The following distributions are available:

Type

Class name

Numbers

Statistical distribution

Uniform

uniform_int_distribution

Integer

Uniform

uniform_real_distribution

Real

Uniform

Bernoulli

bernoulli_distribution

Boolean

Bernoulli

binomial_distribution

Integer

Binomial

negative_binomial_distribution

Integer

Negative binomial

geometric_distribution

Integer

Geometric

Poisson

poisson_distribution

Integer

Poisson

exponential_distribution

Real

Exponential

gamma_distribution

Real

Gamma

weibull_distribution

Real

Weibull

extreme_value_distribution

Real

Extreme value

Normal

normal_distribution

Real

Standard normal (Gaussian)

lognormal_distribution

Real

Lognormal

chi_squared_distribution

Real

Chi-squared

cauchy_distribution

Real

Cauchy

fisher_f_distribution

Real

Fisher's F-distribution

student_t_distribution

Real

Student's t-distribution

Sampling

discrete_distribution

Integer

Discrete

piecewise_constant_distribution

Real

Values distributed on constant subintervals

piecewise_linear_distribution

Real

Values distributed on defined subintervals

Each of the engines provided by the library has advantages and disadvantages, as it was mentioned earlier. The Mersenne twister, although the slowest and one that has the largest internal state, when initialized appropriately, can produce the longest non-repeating sequence of numbers. In the following examples, we will use std::mt19937, a 32-bit Mersenne twister with 19,937 bits of internal state.

The simplest way to generate random numbers looks like this:

auto mtgen = std::mt19937 {};
for (auto i = 0; i < 10; ++i)
  std::cout << mtgen() << '\n';

In this example, mtgen is std::mt19937 for the Mersenne twister. To generate numbers, you only need to use the call operator that advances the internal state and returns the next pseudo-random number. However, this code is flawed, as the engine is not seeded. As a result, it always produces the same sequence of numbers, which is probably not what you want in most cases.

There are different approaches for initializing the engine. One approach, common with the C random library, is to use the current time. In modern C++, it should look like this:

auto seed = std::chrono::high_resolution_clock::now()
            .time_since_epoch()
            .count();
auto mtgen = std::mt19937{ static_cast<unsigned int>(seed) };

In this example, seed is a number representing the number of ticks since the clock's epoch until the present moment. This number is then used to seed the engine. The problem with this approach is that the value of that seed is actually deterministic, and in some classes of applications, it could be prone to attacks. A more reliable approach is to seed the generator with actual random numbers.

The std::random_device class is an engine that is supposed to return true random numbers, though implementations could actually be based on a pseudo-random generator:

std::random_device rd;
auto mtgen = std::mt19937 {rd()};

Numbers produced by all engines follow a uniform distribution. To convert the result to another statistical distribution, we have to use a distribution class. To show how generated numbers are distributed according to the selected distribution, we will use the following function. This function generates a specified number of pseudo-random numbers and counts their repetition in a map. The values from the map are then used to produce a bar-like diagram showing how often each number occurred:

void generate_and_print(std::function<int(void)> gen,
                        int const iterations = 10000)
{
  // map to store the numbers and their repetition
  auto data = std::map<int, int>{};
  // generate random numbers
  for (auto n = 0; n < iterations; ++n)
    ++data[gen()];
  // find the element with the most repetitions
  auto max = std::max_element(
             std::begin(data), std::end(data),
             [](auto kvp1, auto kvp2) {
    return kvp1.second < kvp2.second; });
  // print the bars
  for (auto i = max->second / 200; i > 0; --i)
  {
    for (auto kvp : data)
    {
      std::cout
        << std::fixed << std::setprecision(1) << std::setw(3)
        << (kvp.second / 200 >= i ? (char)219 : ' ');
    }
    std::cout << '\n';
  }
  // print the numbers
  for (auto kvp : data)
  {
    std::cout
      << std::fixed << std::setprecision(1) << std::setw(3)
      << kvp.first;
  }
  std::cout << '\n';
}

The following code generates random numbers using the std::mt19937 engine with a uniform distribution in the range [1, 6]; this is basically what you get when you throw a dice:

std::random_device rd{};
auto mtgen = std::mt19937{ rd() };
auto ud = std::uniform_int_distribution<>{ 1, 6 };
generate_and_print([&mtgen, &ud]() {return ud(mtgen); });

The output of the program looks like this:

Figure 2.1: Uniform distribution of the range [1,6]

In the next and final example, we're changing the distribution to a normal distribution with a mean of 5 and a standard deviation of 2. This distribution produces real numbers; therefore, in order to use the previous generate_and_print() function, the numbers must be rounded to integers:

std::random_device rd{};
auto mtgen = std::mt19937{ rd() };
auto nd = std::normal_distribution<>{ 5, 2 };
generate_and_print(
  [&mtgen, &nd]() {
    return static_cast<int>(std::round(nd(mtgen))); });

The following will be the output of the preceding code:

Figure 2.2: Normal distribution with mean 5 and standard variance 2

Here, we can see that, based on the graphical representation, the distribution has changed from a uniform one to a normal one with the mean at value 5.

See also

  • Initializing all bits of internal state of a pseudo-random number generator to learn how to properly initialize random number engines

Initializing all bits of internal state of a pseudo-random number generator

In the previous recipe, we looked at the pseudo-random number library, along with its components, and how it can be used to produce numbers in different statistical distributions. One important factor that was overlooked in that recipe is the proper initialization of the pseudo-random number generators.

With careful analysis (that is beyond the purpose of this recipe or this book), it can be shown that the Mersenne twister engine has a bias toward producing some values repeatedly and omitting others, thus generating numbers not in a uniform distribution, but rather in a binomial or Poisson distribution. In this recipe, you will learn how to initialize a generator in order to produce pseudo-random numbers with a true uniform distribution.

Getting ready

You should read the previous recipe, Generating pseudo-random numbers, to get an overview of what the pseudo-random number library offers.

How to do it...

To properly initialize a pseudo-random number generator to produce a uniformly distributed sequence of pseudo-random numbers, perform the following steps:

  1. Use an std::random_device to produce random numbers to be used as seeding values:
    std::random_device rd;
    
  2. Generate random data for all internal bits of the engine:
    std::array<int, std::mt19937::state_size> seed_data {};
    std::generate(std::begin(seed_data), std::end(seed_data),
                  std::ref(rd));
    
  3. Create an std::seed_seq object from the previously generated pseudo-random data:
    std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
    
  4. Create an engine object and initialize all the bits representing the internal state of the engine; for example, an mt19937 has 19,937 bits of internal states:
    auto eng = std::mt19937{ seq };
    
  5. Use the appropriate distribution based on the requirements of the application:
    auto dist = std::uniform_real_distribution<>{ 0, 1 };
    

How it works...

In all the examples shown in the previous recipe, we used the std::mt19937 engine to produce pseudo-random numbers. Though the Mersenne twister is slower than the other engines, it can produce the longest sequences of non-repeating numbers with the best spectral characteristics. However, initializing the engine in the manner shown in the previous recipe will not have this effect. The problem is that the internal state of mt19937 has 624 32-bit integers, and in the examples from the previous recipe, we have only initialized one of them.

When working with the pseudo-random number library, remember the following rule of thumb (shown in the information box):

In order to produce the best results, engines must have all their internal state properly initialized before generating numbers.

The pseudo-random number library provides a class for this particular purpose, called std::seed_seq. This is a generator that can be seeded with any number of 32-bit integers and produces the requested number of integers evenly distributed in the 32-bit space.

In the preceding code from the How to do it... section, we defined an array called seed_data with a number of 32-bit integers equal to the internal state of the mt19937 generator; that is, 624 integers. Then, we initialized the array with random numbers produced by std::random_device. The array was later used to seed std::seed_seq, which, in turn, was used to seed the mt19937 generator.

See also

  • Generating pseudo-random numbers to familiarize yourself with the capabilities of the standard numerics library for generating pseudo-random numbers

Creating cooked user-defined literals

Literals are constants of built-in types (numerical, Boolean, character, character string, and pointer) that cannot be altered in a program. The language defines a series of prefixes and suffixes to specify literals (and the prefix/suffix is actually part of the literal). C++11 allows us to create user-defined literals by defining functions called literal operators, which introduce suffixes for specifying literals. These work only with numerical character and character string types.

This opens the possibility of defining both standard literals in future versions and allows developers to create their own literals. In this recipe, we will learn how to create our own cooked literals.

Getting ready

User-defined literals can have two forms: raw and cooked. Raw literals are not processed by the compiler, whereas cooked literals are values processed by the compiler (examples can include handling escape sequences in a character string or identifying numerical values such as integer 2898 from literal 0xBAD). Raw literals are only available for integral and floating-point types, whereas cooked literals are also available for character and character string literals.

How to do it...

To create cooked user-defined literals, you should follow these steps:

  1. Define your literals in a separate namespace to avoid name clashes.
  2. Always prefix the user-defined suffix with an underscore (_).
  3. Define a literal operator of one of the following forms for cooked literals:
    T operator "" _suffix(unsigned long long int);
    T operator "" _suffix(long double);
    T operator "" _suffix(char);
    T operator "" _suffix(wchar_t);
    T operator "" _suffix(char16_t);
    T operator "" _suffix(char32_t);
    T operator "" _suffix(char const *, std::size_t);
    T operator "" _suffix(wchar_t const *, std::size_t);
    T operator "" _suffix(char16_t const *, std::size_t);
    T operator "" _suffix(char32_t const *, std::size_t);
    

The following example creates a user-defined literal for specifying kilobytes:

namespace compunits
{
  constexpr size_t operator "" _KB(unsigned long long const size)
  {
    return static_cast<size_t>(size * 1024);
  }
}
auto size{ 4_KB };         // size_t size = 4096;
using byte = unsigned char;
auto buffer = std::array<byte, 1_KB>{};

How it works...

When the compiler encounters a user-defined literal with a user-defined suffix, S (it always has a leading underscore for third-party suffixes, as suffixes without a leading underscore are reserved for the standard library), it does an unqualified name lookup in order to identify a function with the name operator "" S. If it finds one, then it calls it according to the type of the literal and the type of the literal operator. Otherwise, the compiler will yield an error.

In the example shown in the How to do it... section, the literal operator is called operator "" _KB and has an argument of type unsigned long long int. This is the only integral type possible for literal operators for handling integral types. Similarly, for floating-point user-defined literals, the parameter type must be long double since for numeric types, the literal operators must be able to handle the largest possible values. This literal operator returns a constexpr value so that it can be used where compile-time values are expected, such as specifying the size of an array, as shown in the preceding example.

When the compiler identifies a user-defined literal and has to call the appropriate user-defined literal operator, it will pick the overload from the overload set according to the following rules:

  • For integral literals: It calls in the following order: the operator that takes an unsigned long long, the raw literal operator that takes a const char*, or the literal operator template.
  • For floating-point literals: It calls in the following order: the operator that takes a long double, the raw literal operator that takes a const char*, or the literal operator template.
  • For character literals: It calls the appropriate operator, depending on the character type (char, wchar_t, char16_t, and char32_t).
  • For string literals: It calls the appropriate operator, depending on the string type, that takes a pointer to the string of characters and the size.

In the following example, we're defining a system of units and quantities. We want to operate with kilograms, pieces, liters, and other types of units. This could be useful in a system that can process orders and you need to specify the amount and unit for each article.

The following are defined in the namespace units:

  • A scoped enumeration for the possible types of units (kilogram, meter, liter, and pieces):
    enum class unit { kilogram, liter, meter, piece, };
    
  • A class template to specify quantities of a particular unit (such as 3.5 kilograms or 42 pieces):
    template <unit U>
    class quantity
    {
      const double amount;
    public:
      constexpr explicit quantity(double const a) : amount(a)
      {}
      explicit operator double() const { return amount; }
    };
    
  • The operator+ and operator- functions for the quantity class template in order to be able to add and subtract quantities:
    template <unit U>
    constexpr quantity<U> operator+(quantity<U> const &q1,
                                    quantity<U> const &q2)
    {
      return quantity<U>(static_cast<double>(q1) +
                         static_cast<double>(q2));
    }
    template <unit U>
    constexpr quantity<U> operator-(quantity<U> const &q1,
                                    quantity<U> const &q2)
    {
      return quantity<U>(static_cast<double>(q1) –
                         static_cast<double>(q2));
    }
    
  • Literal operators to create quantity literals, defined in an inner namespace called unit_literals. The purpose of this is to avoid possible name clashes with literals from other namespaces.

    If such collisions do happen, developers could select the ones that they should use using the appropriate namespace in the scope where the literals need to be defined:

    namespace unit_literals
    {
      constexpr quantity<unit::kilogram> operator "" _kg(
          long double const amount)
      {
        return quantity<unit::kilogram>
          { static_cast<double>(amount) };
      }
      constexpr quantity<unit::kilogram> operator "" _kg(
          unsigned long long const amount)
      {
        return quantity<unit::kilogram>
          { static_cast<double>(amount) };
      }
      constexpr quantity<unit::liter> operator "" _l(
          long double const amount)
      {
        return quantity<unit::liter>
          { static_cast<double>(amount) };
      }
      constexpr quantity<unit::meter> operator "" _m(
          long double const amount)
      {
        return quantity<unit::meter>
          { static_cast<double>(amount) };
      }
      constexpr quantity<unit::piece> operator "" _pcs(
          unsigned long long const amount)
      {
        return quantity<unit::piece>
          { static_cast<double>(amount) };
      }
    }
    

By looking carefully, you can note that the literal operators defined earlier are not the same:

  • _kg is defined for both integral and floating-point literals; that enables us to create both integral and floating-point values such as 1_kg and 1.0_kg.
  • _l and _m are defined only for floating-point literals; this means we can only define quantity literals for these units with floating points, such as 4.5_l and 10.0_m.
  • _pcs is only defined for integral literals; this means we can only define quantities of an integer number of pieces, such as 42_pcs.

Having these literal operators available, we can operate with various quantities. The following examples show both valid and invalid operations:

using namespace units;
using namespace unit_literals;
auto q1{ 1_kg };    // OK
auto q2{ 4.5_kg };  // OK
auto q3{ q1 + q2 }; // OK
auto q4{ q2 - q1 }; // OK
// error, cannot add meters and pieces
auto q5{ 1.0_m + 1_pcs };
// error, cannot have an integer number of liters
auto q6{ 1_l };
// error, can only have an integer number of pieces
auto q7{ 2.0_pcs}

q1 is a quantity of 1 kg; this is an integer value. Since an overloaded operator "" _kg(unsigned long long const) exists, the literal can be correctly created from the integer 1. Similarly, q2 is a quantity of 4.5 kilograms; this is a real value. Since an overloaded operator "" _kg(long double) exists, the literal can be created from the double floating-point value 4.5.

On the other hand, q6 is a quantity of 1 liter. Since there is no overloaded operator "" _l(unsigned long long), the literal cannot be created. It would require an overload that takes an unsigned long long, but such an overload does not exist. Similarly, q7 is a quantity of 2.0 pieces, but piece literals can only be created from integer values and, therefore, this generates another compiler error.

There's more...

Though user-defined literals are available from C++11, standard literal operators have been available only from C++14. Further standard user-defined literals have been added to the next versions of the standard. The following is a list of these standard literal operators:

  • operator""s for defining std::basic_string literals and operator""sv (in C++17) for defining std::basic_string_view literals:
    using namespace std::string_literals;
    auto s1{  "text"s }; // std::string
    auto s2{ L"text"s }; // std::wstring
    auto s3{ u"text"s }; // std::u16string
    auto s4{ U"text"s }; // std::u32string
    using namespace std::string_view_literals;
    auto s5{ "text"sv }; // std::string_view
    
  • operator""h, operator""min, operator""s, operator""ms, operator""us, and operator""ns for creating an std::chrono::duration value:
    using namespace std::chrono_literals;
    // std::chrono::duration<long long>
    auto timer {2h + 42min + 15s};
    
  • operator""y for creating an std::chrono::year literal and operator""d for creating an std::chrono::day literal that represents a day of a month, both added to C++20:
    using namespace std::chrono_literals;
    auto year { 2020y }; // std::chrono::year
    auto day { 15d };    // std::chrono::day
    
  • operator""if, operator""i, and operator""il for creating an std::complex value:
    using namespace std::complex_literals;
    auto c{ 12.0 + 4.5i }; // std::complex<double>
    

The standard user-defined literals are available in multiple namespaces. For instance, the ""s and ""sv literals for strings are defined in the namespace std::literals::string_literals.

However, both literals and string_literals are inlined namespaces. Therefore, you can access the literals with using namespace std::literals, using namespace std::string_literals, or using namespace std::literals::string_literals. In the previous examples, the second form was preferred.

See also

  • Using raw string literals to avoid escaping characters to learn how to define string literals without the need to escape special characters
  • Creating raw user-defined literals to understand how to provide a custom interpretation of an input sequence so that it changes the normal behavior of the compiler
  • Using inline namespaces for symbol versioning in Chapter 1, Learning Modern Core Language Features, to learn how to version your source code using inline namespaces and conditional compilation

Creating raw user-defined literals

In the previous recipe, we looked at the way C++11 allows library implementers and developers to create user-defined literals and the user-defined literals available in the C++14 standard. However, user-defined literals have two forms: a cooked form, where the literal value is processed by the compiler before being supplied to the literal operator, and a raw form, in which the literal is not processed by the compiler before being supplied to the literal operator. The latter is only available for integral and floating-point types. Raw literals are useful for altering the compiler's normal behavior. For instance, a sequence such as 3.1415926 is interpreted by the compiler as a floating-point value, but with the use of a raw user-defined literal, it could be interpreted as a user-defined decimal value. In this recipe, we will look at creating raw user-defined literals.

Getting ready

Before continuing with this recipe, it is strongly recommended that you go through the previous one, Creating cooked user-defined literals, as general details about user-defined literals will not be reiterated here.

To exemplify the way raw user-defined literals can be created, we will define binary literals. These binary literals can be of 8-bit, 16-bit, and 32-bit (unsigned) types. These types will be called byte8, byte16, and byte32, and the literals we will create will be called _b8, _b16, and _b32.

How to do it...

To create raw user-defined literals, you should follow these steps:

  1. Define your literals in a separate namespace to avoid name clashes.
  2. Always prefix the used-defined suffix with an underscore (_).
  3. Define a literal operator or literal operator template of the following form:
    T operator "" _suffix(const char*);
    template<char...> T operator "" _suffix();
    

The following example shows a possible implementation of 8-bit, 16-bit, and 32-bit binary literals:

namespace binary
{
  using byte8  = unsigned char;
  using byte16 = unsigned short;
  using byte32 = unsigned int;
  namespace binary_literals
  {
    namespace binary_literals_internals
    {
      template <typename CharT, char... bits>
      struct binary_struct;
      template <typename CharT, char... bits>
      struct binary_struct<CharT, '0', bits...>
      {
        static constexpr CharT value{
          binary_struct<CharT, bits...>::value };
      };
      template <typename CharT, char... bits>
      struct binary_struct<CharT, '1', bits...>
      {
        static constexpr CharT value{
          static_cast<CharT>(1 << sizeof...(bits)) |
          binary_struct<CharT, bits...>::value };
      };
      template <typename CharT>
      struct binary_struct<CharT>
      {
        static constexpr CharT value{ 0 };
      };
    }
    template<char... bits>
    constexpr byte8 operator""_b8()
    {
      static_assert(
        sizeof...(bits) <= 8,
        "binary literal b8 must be up to 8 digits long");
      return binary_literals_internals::
                binary_struct<byte8, bits...>::value;
    }
    template<char... bits>
    constexpr byte16 operator""_b16()
    {
      static_assert(
        sizeof...(bits) <= 16,
        "binary literal b16 must be up to 16 digits long");
      return binary_literals_internals::
                binary_struct<byte16, bits...>::value;
    }
    template<char... bits>
    constexpr byte32 operator""_b32()
    {
      static_assert(
        sizeof...(bits) <= 32,
        "binary literal b32 must be up to 32 digits long");
      return binary_literals_internals::
                binary_struct<byte32, bits...>::value;
    }
  }
}

How it works...

First of all, we define everything inside a namespace called binary and start with introducing several type aliases: byte8, byte16, and byte32. These represent integral types of 8 bits, 16 bits, and 32 bits, as the names imply.

The implementation in the previous section enables us to define binary literals of the form 1010_b8 (a byte8 value of decimal 10) or 000010101100_b16 (a byte16 value of decimal 2130496). However, we want to make sure that we do not exceed the number of digits for each type. In other words, values such as 111100001_b8 should be illegal and the compiler should yield an error.

The literal operator templates are defined in a nested namespace called binary_literal_internals. This is a good practice in order to avoid name collisions with other literal operators from other namespaces. Should something like that happen, you can choose to use the appropriate namespace in the right scope (such as one namespace in a function or block and another namespace in another function or block).

The three literal operator templates are very similar. The only things that are different are their names (_b8, _16, and _b32), return type (byte8, byte16, and byte32), and the condition in the static assert that checks the number of digits.

We will explore the details of variadic templates and template recursion in a later recipe; however, for a better understanding, this is how this particular implementation works: bits is a template parameter pack that is not a single value, but all the values the template could be instantiated with. For example, if we consider the literal 1010_b8, then the literal operator template would be instantiated as operator"" _b8<'1', '0', '1', '0'>(). Before proceeding with computing the binary value, we check the number of digits in the literal. For _b8, this must not exceed eight (including any trailing zeros). Similarly, it should be up to 16 digits for _b16 and 32 for _b32. For this, we use the sizeof... operator, which returns the number of elements in a parameter pack (in this case, bits).

If the number of digits is correct, we can proceed to expand the parameter pack and recursively compute the decimal value represented by the binary literal. This is done with the help of an additional class template and its specializations. These templates are defined in yet another nested namespace, called binary_literals_internals. This is also a good practice because it hides (without proper qualification) the implementation details from the client (unless an explicit using namespace directive makes them available to the current namespace).

Even though this looks like recursion, it is not a true runtime recursion. This is because after the compiler expands and generates the code from templates, what we end up with is basically calls to overloaded functions with a different number of parameters. This is explained later in the Writing a function template with a variable number of arguments recipe.

The binary_struct class template has a template type of CharT for the return type of the function (we need this because our literal operator templates should return either byte8, byte16, or byte32) and a parameter pack:

template <typename CharT, char... bits>
struct binary_struct;

Several specializations of this class template are available with parameter pack decomposition. When the first digit of the pack is '0', the computed value remains the same, and we continue expanding the rest of the pack. If the first digit of the pack is '1', then the new value is 1, shifted to the left with the number of digits in the remainder of the pack bit, or the value of the rest of the pack:

template <typename CharT, char... bits>
struct binary_struct<CharT, '0', bits...>
{
  static constexpr CharT value{
    binary_struct<CharT, bits...>::value };
};
template <typename CharT, char... bits>
struct binary_struct<CharT, '1', bits...>
{
  static constexpr CharT value{
    static_cast<CharT>(1 << sizeof...(bits)) |
    binary_struct<CharT, bits...>::value };
};

The last specialization covers the case where the pack is empty; in this case, we return 0:

template <typename CharT>
struct binary_struct<CharT>
{
  static constexpr CharT value{ 0 };
};

After defining these helper classes, we could implement the byte8, byte16, and byte32 binary literals as intended. Note that we need to bring the content of the namespace binary_literals into the current namespace in order to use the literal operator templates:

using namespace binary;
using namespace binary_literals;
auto b1 = 1010_b8;
auto b2 = 101010101010_b16;
auto b3 = 101010101010101010101010_b32;

The following definitions trigger compiler errors:

// binary literal b8 must be up to 8 digits long
auto b4 = 0011111111_b8;
// binary literal b16 must be up to 16 digits long
auto b5 = 001111111111111111_b16;
// binary literal b32 must be up to 32 digits long
auto b6 = 0011111111111111111111111111111111_b32;

The reason for this is that the condition in static_assert is not met. The length of the sequence of characters preceding the literal operator is greater than expected, in all cases.

See also

  • Using raw string literals to avoid escaping characters to learn how to define string literals without the need to escape special characters
  • Creating cooked user-defined literals to learn how to create literals of user-defined types
  • Writing a function template with a variable number of arguments in Chapter 3 to see how variadic templates enable us to write functions that can take any number of arguments
  • Creating type aliases and alias templates in Chapter 1 to learn about aliases for types

Using raw string literals to avoid escaping characters

Strings may contain special characters, such as non-printable characters (newline, horizontal and vertical tab, and so on), string and character delimiters (double and single quotes), or arbitrary octal, hexadecimal, or Unicode values. These special characters are introduced with an escape sequence that starts with a backslash, followed by either the character (examples include ' and "), its designated letter (examples include n for a new line, t for a horizontal tab), or its value (examples include octal 050, hexadecimal XF7, or Unicode U16F0). As a result, the backslash character itself has to be escaped with another backslash character. This leads to more complicated literal strings that can be hard to read.

To avoid escaping characters, C++11 introduced raw string literals that do not process escape sequences. In this recipe, you will learn how to use the various forms of raw string literals.

Getting ready

In this recipe, and throughout the rest of this book, I will use the s suffix to define basic_string literals. This was covered earlier in this chapter in the Creating cooked user-defined literals recipe.

How to do it...

To avoid escaping characters, define the string literals with one of the following forms:

  • R"( literal )" as the default form:
    auto filename {R"(C:\Users\Marius\Documents\)"s};
    auto pattern {R"((\w+)=(\d+)$)"s};
    auto sqlselect {
      R"(SELECT *
      FROM Books
      WHERE Publisher='Packtpub'
      ORDER BY PubDate DESC)"s};
    
  • R"delimiter( literal )delimiter", where delimiter is any sequence of characters excluding parentheses, backslash, and spaces, and literal is any sequence of characters with the limitation that it cannot include the closing sequence )delimiter". Here is an example with !! as delimiter:
    auto text{ R"!!(This text contains both "( and )".)!!"s };
    std::cout << text << '\n';
    

How it works...

When string literals are used, escapes are not processed, and the actual content of the string is written between the delimiter (in other words, what you see is what you get). The following example shows what appears as the same raw literal string; however, the second one still contains escaped characters. Since these are not processed in the case of string literals, they will be printed as they are in the output:

auto filename1 {R"(C:\Users\Marius\Documents\)"s};
auto filename2 {R"(C:\\Users\\Marius\\Documents\\)"s};
// prints C:\Users\Marius\Documents\
std::cout << filename1 << '\n';
// prints C:\\Users\\Marius\\Documents\\
std::cout << filename2 << '\n';

If the text has to contain the )" sequence, then a different delimiter must be used, in the R"delimiter( literal )delimiter" form. According to the standard, the possible characters in a delimiter can be as follows:

Any member of the basic source character set except: space, the left parenthesis (the right parenthesis ), the backslash \, and the control characters representing horizontal tab, vertical tab, form feed, and newline.

Raw string literals can be prefixed by one of L, u8, u, and U to indicate a wide, UTF-8, UTF-16, or UTF-32 string literal, respectively. The following are examples of such string literals:

auto t1{ LR"(text)"  };  // const wchar_t*
auto t2{ u8R"(text)" };  // const char*
auto t3{ uR"(text)"  };  // const char16_t*
auto t4{ UR"(text)"  };  // const char32_t*
auto t5{ LR"(text)"s  }; // wstring
auto t6{ u8R"(text)"s }; // string
auto t7{ uR"(text)"s  }; // u16string
auto t8{ UR"(text)"s  }; // u32string

Note that the presence of the suffix ""s at the end of the string makes the compiler deduce the type as various string classes and not character arrays.

See also

  • Creating cooked user-defined literals to learn how to create literals of user-defined types

Creating a library of string helpers

The string types from the standard library are a general-purpose implementation that lacks many helpful methods, such as changing the case, trimming, splitting, and others that may address different developer needs. Third-party libraries that provide rich sets of string functionalities exist. However, in this recipe, we will look at implementing several simple, yet helpful, methods you may often need in practice. The purpose is rather to see how string methods and standard general algorithms can be used for manipulating strings, but also to have a reference to reusable code that can be used in your applications.

In this recipe, we will implement a small library of string utilities that will provide functions for the following:

  • Changing a string into lowercase or uppercase
  • Reversing a string
  • Trimming white spaces from the beginning and/or the end of the string
  • Trimming a specific set of characters from the beginning and/or the end of the string
  • Removing occurrences of a character anywhere in the string
  • Tokenizing a string using a specific delimiter

Before we start with the implementation, let's look at some prerequisites.

Getting ready

The string library we will be implementing should work with all the standard string types; that is, std::string, std::wstring, std::u16string, and std::u32string.

To avoid specifying long names such as std::basic_string<CharT, std::char_traits<CharT>, std::allocator<CharT>>, we will use the following alias templates for strings and string streams:

template <typename CharT>
using tstring =
  std::basic_string<CharT, std::char_traits<CharT>,
                    std::allocator<CharT>>;
template <typename CharT>
using tstringstream =
  std::basic_stringstream<CharT, std::char_traits<CharT>,
                          std::allocator<CharT>>;

To implement these string helper functions, we need to include the header <string> for strings and <algorithm> for the general standard algorithms we will use.

In all the examples in this recipe, we will use the standard user-defined literal operators for strings from C++14, for which we need to explicitly use the std::string_literals namespace.

How to do it...

  1. To convert a string to lowercase or uppercase, apply the tolower() or toupper() functions to the characters of a string using the general-purpose algorithm std::transform():
    template<typename CharT>
    inline tstring<CharT> to_upper(tstring<CharT> text)
    {
      std::transform(std::begin(text), std::end(text),
                     std::begin(text), toupper);
      return text;
    }
    template<typename CharT>
    inline tstring<CharT> to_lower(tstring<CharT> text)
    {
      std::transform(std::begin(text), std::end(text),
                     std::begin(text), tolower);
      return text;
    }
    
  2. To reverse a string, use the general-purpose algorithm std::reverse():
    template<typename CharT>
    inline tstring<CharT> reverse(tstring<CharT> text)
    {
      std::reverse(std::begin(text), std::end(text));
      return text;
    }
    
  3. To trim a string, at the beginning, end, or both, use the std::basic_string methods find_first_not_of() and find_last_not_of():
    template<typename CharT>
    inline tstring<CharT> trim(tstring<CharT> const & text)
    {
      auto first{ text.find_first_not_of(' ') };
      auto last{ text.find_last_not_of(' ') };
      return text.substr(first, (last - first + 1));
    }
    template<typename CharT>
    inline tstring<CharT> trimleft(tstring<CharT> const & text)
    {
      auto first{ text.find_first_not_of(' ') };
      return text.substr(first, text.size() - first);
    }
    template<typename CharT>
    inline tstring<CharT> trimright(tstring<CharT> const & text)
    {
      auto last{ text.find_last_not_of(' ') };
      return text.substr(0, last + 1);
    }
    
  4. To trim characters in a given set from a string, use overloads of the std::basic_string methods find_first_not_of() and find_last_not_of(), which take a string parameter that defines the set of characters to look for:
    template<typename CharT>
    inline tstring<CharT> trim(tstring<CharT> const & text,
                               tstring<CharT> const & chars)
    {
      auto first{ text.find_first_not_of(chars) };
      auto last{ text.find_last_not_of(chars) };
      return text.substr(first, (last - first + 1));
    }
    template<typename CharT>
    inline tstring<CharT> trimleft(tstring<CharT> const & text,
                                   tstring<CharT> const & chars)
    {
      auto first{ text.find_first_not_of(chars) };
      return text.substr(first, text.size() - first);
    }
    template<typename CharT>
    inline tstring<CharT> trimright(tstring<CharT> const &text,
                                    tstring<CharT> const &chars)
    {
      auto last{ text.find_last_not_of(chars) };
      return text.substr(0, last + 1);
    }
    
  5. To remove characters from a string, use std::remove_if() and std::basic_string::erase():
    template<typename CharT>
    inline tstring<CharT> remove(tstring<CharT> text,
                                 CharT const ch)
    {
      auto start = std::remove_if(
                      std::begin(text), std::end(text),
                      [=](CharT const c) {return c == ch; });
      text.erase(start, std::end(text));
      return text;
    }
    
  6. To split a string based on a specified delimiter, use std::getline() to read from an std::basic_stringstream initialized with the content of the string. The tokens extracted from the stream are pushed into a vector of strings:
    template<typename CharT>
    inline std::vector<tstring<CharT>> split
      (tstring<CharT> text, CharT const delimiter)
    {
      auto sstr = tstringstream<CharT>{ text };
      auto tokens = std::vector<tstring<CharT>>{};
      auto token = tstring<CharT>{};
      while (std::getline(sstr, token, delimiter))
      {
        if (!token.empty()) tokens.push_back(token);
      }
      return tokens;
    }
    

How it works...

To implement the utility functions from the library, we have two options:

  • Functions would modify a string passed by a reference
  • Functions would not alter the original string but return a new string

The second option has the advantage that it preserves the original string, which may be helpful in many cases. Otherwise, in those cases, you would first have to make a copy of the string and alter the copy. The implementation provided in this recipe takes the second approach.

The first functions we implemented in the How to do it... section were to_upper() and to_lower(). These functions change the content of a string either to uppercase or lowercase. The simplest way to implement this is using the std::transform() standard algorithm. This is a general-purpose algorithm that applies a function to every element of a range (defined by a begin and end iterator) and stores the result in another range for which only the begin iterator needs to be specified. The output range can be the same as the input range, which is exactly what we did to transform the string. The applied function is toupper() or tolower():

auto ut{ string_library::to_upper("this is not UPPERCASE"s) };
// ut = "THIS IS NOT UPPERCASE"
auto lt{ string_library::to_lower("THIS IS NOT lowercase"s) };
// lt = "this is not lowercase"

The next function we considered was reverse(), which, as the name implies, reverses the content of a string. For this, we used the std::reverse() standard algorithm. This general-purpose algorithm reverses the elements of a range defined by a begin and end iterator:

auto rt{string_library::reverse("cookbook"s)}; // rt = "koobkooc"

When it comes to trimming, a string can be trimmed at the beginning, end, or both sides. Because of that, we implemented three different functions: trim() for trimming at both ends, trimleft() for trimming at the beginning of a string, and trimright() for trimming at the end of a string. The first version of the function trims only spaces. In order to find the right part to trim, we use the find_first_not_of() and find_last_not_of() methods of std::basic_string. These return the first and last characters in the string that are not of the specified character. Subsequently, a call to the substr() method of std::basic_string returns a new string. The substr() method takes an index in the string and a number of elements to copy to the new string:

auto text1{"   this is an example   "s};
// t1 = "this is an example"
auto t1{ string_library::trim(text1) };
// t2 = "this is an example   "
auto t2{ string_library::trimleft(text1) };
// t3 = "   this is an example"
auto t3{ string_library::trimright(text1) };

Sometimes, it can be useful to trim other characters and then spaces from a string. In order to do that, we provided overloads for the trimming functions that specify a set of characters to be removed. That set is also specified as a string. The implementation is very similar to the previous one because both find_first_not_of() and find_last_not_of() have overloads that take a string containing the characters to be excluded from the search:

auto chars1{" !%\n\r"s};
auto text3{"!!  this % needs a lot\rof trimming  !\n"s};
auto t7{ string_library::trim(text3, chars1) };
// t7 = "this % needs a lot\rof trimming"
auto t8{ string_library::trimleft(text3, chars1) };
// t8 = "this % needs a lot\rof trimming  !\n"
auto t9{ string_library::trimright(text3, chars1) };
// t9 = "!!  this % needs a lot\rof trimming"

If removing characters from any part of the string is necessary, the trimming methods are not helpful because they only treat a contiguous sequence of characters at the start and end of a string. For that, however, we implemented a simple remove() method. This uses the std:remove_if() standard algorithm.

Both std::remove() and std::remove_if() work in a way that may not be very intuitive at first. They remove elements that satisfy the criteria from a range defined by a first and last iterator by rearranging the content of the range (using move assignment). The elements that need to be removed are placed at the end of the range, and the function returns an iterator to the first element in the range that represents the removed elements. This iterator basically defines the new end of the range that was modified. If no element was removed, the returned iterator is the end iterator of the original range. The value of this returned iterator is then used to call the std::basic_string::erase() method, which actually erases the content of the string defined by two iterators. The two iterators in our case are the iterator returned by std::remove_if() and the end of the string:

auto text4{"must remove all * from text**"s};
auto t10{ string_library::remove(text4, '*') };
// t10 = "must remove all  from text"
auto t11{ string_library::remove(text4, '!') };
// t11 = "must remove all * from text**"

The last method we implemented, split(), splits the content of a string based on a specified delimiter. There are various ways to implement this. In this implementation, we used std::getline(). This function reads characters from an input stream until a specified delimiter is found and places the characters in a string. Before starting to read from the input buffer, it calls erase() on the output string to clear its content. Calling this method in a loop produces tokens that are placed in a vector. In our implementation, empty tokens were skipped from the result set:

auto text5{"this text will be split   "s};
auto tokens1{ string_library::split(text5, ' ') };
// tokens1 = {"this", "text", "will", "be", "split"}
auto tokens2{ string_library::split(""s, ' ') };
// tokens2 = {}

Two examples of text splitting are shown here. In the first example, the text from the text5 variable is split into words and, as mentioned earlier, empty tokens are ignored. In the second example, splitting an empty string produces an empty vector of token.

See also

  • Creating cooked user-defined literals to learn how to create literals of user-defined types
  • Creating type aliases and alias templates in Chapter 1, Learning Modern Core Language Features, to learn about aliases for types

Verifying the format of a string using regular expressions

Regular expressions are a language intended for performing pattern matching and replacements in texts. C++11 provides support for regular expressions within the standard library through a set of classes, algorithms, and iterators available in the header <regex>. In this recipe, we will learn how regular expressions can be used to verify that a string matches a pattern (examples can include verifying an email or IP address formats).

Getting ready

Throughout this recipe, we will explain, whenever necessary, the details of the regular expressions that we use. However, you should have at least some basic knowledge of regular expressions in order to use the C++ standard library for regular expressions. A description of regular expressions syntax and standards is beyond the purpose of this book; if you are not familiar with regular expressions, it is recommended that you read more about them before continuing with the recipes that focus on regular expressions. Good online resources for learning, building, and debugging regular expressions can be found at https://regexr.com and https://regex101.com.

How to do it...

In order to verify that a string matches a regular expression, perform the following steps:

  1. Include the headers <regex> and <string> and the namespace std::string_literals for C++14 standard user-defined literals for strings:
    #include <regex>
    #include <string>
    using namespace std::string_literals;
    
  2. Use raw string literals to specify the regular expression to avoid escaping backslashes (which can occur frequently). The following regular expression validates most email formats:
    auto pattern {R"(^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$)"s};
    
  3. Create an std::regex/std::wregex object (depending on the character set that is used) to encapsulate the regular expression:
    auto rx = std::regex{pattern};
    
  4. To ignore casing or specify other parsing options, use an overloaded constructor that has an extra parameter for regular expression flags:
    auto rx = std::regex{pattern, std::regex_constants::icase};
    
  5. Use std::regex_match() to match the regular expression with an entire string:
    auto valid = std::regex_match("marius@domain.com"s, rx);
    

How it works...

Considering the problem of verifying the format of email addresses, even though this may look like a trivial problem, in practice, it is hard to find a simple regular expression that covers all the possible cases for valid email formats. In this recipe, we will not try to find that ultimate regular expression, but rather apply a regular expression that is good enough for most cases. The regular expression we will use for this purpose is this:

^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$

The following table explains the structure of the regular expression:

Part

Description

^

Start of string.

[A-Z0-9._%+-]+

At least one character in the range A-Z, 0-9, or one of -, %, +, or - that represents the local part of the email address.

@

The character @.

[A-Z0-9.-]+

At least one character in the range A-Z, 0-9, or one of -, %, +, or - that represents the hostname of the domain part.

\.

A dot that separates the domain hostname and label.

[A-Z]{2,}

The DNS label of a domain that can have between 2 and 63 characters.

$

End of the string.

Bear in mind that, in practice, a domain name is composed of a hostname followed by a dot-separated list of DNS labels. Examples include localhost, gmail.com and yahoo.co.uk. This regular expression we are using does not match domains without DNS labels, such as localhost (an email, such as root@localhost, is a valid email). The domain name can also be an IP address specified in brackets, such as [192.168.100.11] (as in john.doe@[192.168.100.11]). Email addresses containing such domains will not match the regular expression defined previously. Even though these rather rare formats will not be matched, the regular expression can cover most email formats.

The regular expression for the example in this chapter is provided for didactical purposes only, and is not intended to be used as it is in production code. As explained earlier, this sample does not cover all possible email formats.

We began by including the necessary headers; that is, <regex> for regular expressions and <string> for strings. The is_valid_email() function, shown in the following code (which basically contains the samples from the How to do it... section), takes a string representing an email address and returns a Boolean indicating whether the email has a valid format or not.

We first construct an std::regex object to encapsulate the regular expression indicated with the raw string literal. Using raw string literals is helpful because it avoids escaping backslashes, which are used for escape characters in regular expressions too. The function then calls std::regex_match(), passing the input text and the regular expression:

bool is_valid_email_format(std::string const & email)
{
  auto pattern {R"(^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$)"s};
  auto rx = std::regex{pattern, std::regex_constants::icase};
  return std::regex_match(email, rx);
}

The std::regex_match() method tries to match the regular expression against the entire string. If successful, it returns true; otherwise, it returns false:

auto ltest = [](std::string const & email)
{
  std::cout << std::setw(30) << std::left
            << email << " : "
            << (is_valid_email_format(email) ?
                "valid format" : "invalid format")
            << '\n';
};
ltest("JOHN.DOE@DOMAIN.COM"s);         // valid format
ltest("JOHNDOE@DOMAIL.CO.UK"s);        // valid format
ltest("JOHNDOE@DOMAIL.INFO"s);         // valid format
ltest("J.O.H.N_D.O.E@DOMAIN.INFO"s);   // valid format
ltest("ROOT@LOCALHOST"s);              // invalid format
ltest("john.doe@domain.com"s);         // invalid format

In this simple test, the only emails that do not match the regular expression are ROOT@LOCALHOST and john.doe@domain.com. The first contains a domain name without a dot-prefixed DNS label, and that case is not covered in the regular expression. The second contains only lowercase letters, and in the regular expression, the valid set of characters for both the local part and the domain name was uppercase letters, A to Z.

Instead of complicating the regular expression with additional valid characters (such as [A-Za-z0-9._%+-]), we can specify that the match can ignore this case. This can be done with an additional parameter to the constructor of the std::basic_regex class. The available constants for this purpose are defined in the regex_constants namespace. The following slight change to is_valid_email_format() will make it ignore the case and allow emails with both lowercase and uppercase letters to correctly match the regular expression:

bool is_valid_email_format(std::string const & email)
{
  auto rx = std::regex{
    R"(^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$)"s,
    std::regex_constants::icase};
  return std::regex_match(email, rx);
}

This is_valid_email_format() function is pretty simple, and if the regular expression was provided as a parameter, along with the text to match, it could be used for matching anything. However, it would be nice to be able to handle not only multi-byte strings (std::string), but also wide strings (std::wstring), with a single function. This can be achieved by creating a function template where the character type is provided as a template parameter:

template <typename CharT>
using tstring = std::basic_string<CharT, std::char_traits<CharT>,
                                  std::allocator<CharT>>;
template <typename CharT>
bool is_valid_format(tstring<CharT> const & pattern,
                     tstring<CharT> const & text)
{
  auto rx = std::basic_regex<CharT>{
    pattern, std::regex_constants::icase };
  return std::regex_match(text, rx);
}

We start by creating an alias template for std::basic_string in order to simplify its use. The new is_valid_format() function is a function template very similar to our implementation of is_valid_email(). However, we now use std::basic_regex<CharT> instead of the typedef std::regex, which is std::basic_regex<char>, and the pattern is provided as the first argument. We now implement a new function called is_valid_email_format_w() for wide strings that relies on this function template. The function template, however, can be reused for implementing other validations, such as if a license plate has a particular format:

bool is_valid_email_format_w(std::wstring const & text)
{
  return is_valid_format(
    LR"(^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$)"s,
    text);
}
auto ltest2 = [](auto const & email)
{
  std::wcout << std::setw(30) << std::left
     << email << L" : "
     << (is_valid_email_format_w(email) ? L"valid" : L"invalid")
     << '\n';
};
ltest2(L"JOHN.DOE@DOMAIN.COM"s);       // valid
ltest2(L"JOHNDOE@DOMAIL.CO.UK"s);      // valid
ltest2(L"JOHNDOE@DOMAIL.INFO"s);       // valid
ltest2(L"J.O.H.N_D.O.E@DOMAIN.INFO"s); // valid
ltest2(L"ROOT@LOCALHOST"s);            // invalid
ltest2(L"john.doe@domain.com"s);       // valid

Of all the examples shown here, the only one that does not match is ROOT@LOCALHOST, as expected.

The std::regex_match() method has, in fact, several overloads, and some of them have a parameter that is a reference to an std::match_results object to store the result of the match. If there is no match, then std::match_results is empty and its size is 0. Otherwise, if there is a match, the std::match_results object is not empty and its size is 1, plus the number of matched subexpressions.

The following version of the function uses the mentioned overloads and returns the matched subexpressions in an std::smatch object. Note that the regular expression is changed as three caption groups are defined—one for the local part, one for the hostname part of the domain, and one for the DNS label. If the match is successful, then the std::smatch object will contain four submatch objects: the first to match the entire string, the second for the first capture group (the local part), the third for the second capture group (the hostname), and the fourth for the third and last capture group (the DNS label). The result is returned in a tuple, where the first item actually indicates success or failure:

std::tuple<bool, std::string, std::string, std::string>
is_valid_email_format_with_result(std::string const & email)
{
  auto rx = std::regex{
    R"(^([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,})$)"s,
    std::regex_constants::icase };
  auto result = std::smatch{};
  auto success = std::regex_match(email, result, rx);
  return std::make_tuple(
    success,
    success ? result[1].str() : ""s,
    success ? result[2].str() : ""s,
    success ? result[3].str() : ""s);
}

Following the preceding code, we use C++17 structured bindings to unpack the content of the tuple into named variables:

auto ltest3 = [](std::string const & email)
{
  auto [valid, localpart, hostname, dnslabel] =
    is_valid_email_format_with_result(email);
  std::cout << std::setw(30) << std::left
     << email << " : "
     << std::setw(10) << (valid ? "valid" : "invalid")
     << "local=" << localpart
     << ";domain=" << hostname
     << ";dns=" << dnslabel
     << '\n';
};
ltest3("JOHN.DOE@DOMAIN.COM"s);
ltest3("JOHNDOE@DOMAIL.CO.UK"s);
ltest3("JOHNDOE@DOMAIL.INFO"s);
ltest3("J.O.H.N_D.O.E@DOMAIN.INFO"s);
ltest3("ROOT@LOCALHOST"s);
ltest3("john.doe@domain.com"s);

The output of the program will be as follows:

Figure 2.3: Output of tests

There's more...

There are multiple versions of regular expressions, and the C++ standard library supports six of them: ECMAScript, basic POSIX, extended POSIX, awk, grep, and egrep (grep with the option -E). The default grammar used is ECMAScript, and in order to use another, you have to explicitly specify the grammar when defining the regular expression. In addition to specifying the grammar, you can also specify parsing options, such as matching by ignoring the case.

The standard library provides more classes and algorithms than what we have seen so far. The main classes available in the library are as follows (all of them are class templates and, for convenience, typedefs are provided for different character types):

  • The class template std::basic_regex defines the regular expression object:
    typedef basic_regex<char>    regex;
    typedef basic_regex<wchar_t> wregex;
    
  • The class template std::sub_match represents a sequence of characters that matches a capture group; this class is actually derived from std::pair, and its first and second members represent iterators to the first and the one-past-end characters in the match sequence. If there is no match sequence, the two iterators are equal:
    typedef sub_match<const char *>            csub_match;
    typedef sub_match<const wchar_t *>         wcsub_match;
    typedef sub_match<string::const_iterator>  ssub_match;
    typedef sub_match<wstring::const_iterator> wssub_match;
    
  • The class template std::match_results is a collection of matches; the first element is always a full match in the target, while the other elements are matches of subexpressions:
    typedef match_results<const char *>            cmatch;
    typedef match_results<const wchar_t *>         wcmatch;
    typedef match_results<string::const_iterator>  smatch;
    typedef match_results<wstring::const_iterator> wsmatch;
    

The algorithms available in the regular expressions standard library are as follows:

  • std::regex_match(): This tries to match a regular expression (represented by an std::basic_regex instance) to an entire string.
  • std::regex_search(): This tries to match a regular expression (represented by an std::basic_regex instance) to a part of a string (including the entire string).
  • std::regex_replace(): This replaces matches from a regular expression according to a specified format.

The iterators available in the regular expressions standard library are as follows:

  • std::regex_interator: A constant forward iterator used to iterate through the occurrences of a pattern in a string. It has a pointer to an std::basic_regex that must live until the iterator is destroyed. Upon creation and when incremented, the iterator calls std::regex_search() and stores a copy of the std::match_results object returned by the algorithm.
  • std::regex_token_iterator: A constant forward iterator used to iterate through the submatches of every match of a regular expression in a string. Internally, it uses a std::regex_iterator to step through the submatches. Since it stores a pointer to an std::basic_regex instance, the regular expression object must live until the iterator is destroyed.

It should be mentioned that the standard regex library has poorer performance compared to other implementations (such as Boost.Regex) and does not support Unicode. Moreover, it could be argued that the API itself is cumbersome to use.

See also

  • Parsing the content of a string using regular expressions to learn how to perform multiple matches of a pattern in a text
  • Replacing the content of a string using regular expressions to see how to perform text replacements with the help of regular expressions
  • Using structured bindings to handle multi-return values in Chapter 1, Learning Modern Core Language Features, to learn how to bind variables to subobjects or elements from the initializing expressions

Parsing the content of a string using regular expressions

In the previous recipe, we looked at how to use std::regex_match() to verify that the content of a string matches a particular format. The library provides another algorithm called std::regex_search() that matches a regular expression against any part of a string, and not only the entire string, as regex_match() does. This function, however, does not allow us to search through all the occurrences of a regular expression in an input string. For this purpose, we need to use one of the iterator classes available in the library.

In this recipe, you will learn how to parse the content of a string using regular expressions. For this purpose, we will consider the problem of parsing a text file containing name-value pairs. Each such pair is defined on a different line and has the format name = value, but lines starting with a # represent comments and must be ignored. The following is an example:

#remove # to uncomment a line
timeout=120
server = 127.0.0.1
#retrycount=3

Before looking at the implementation details, let's consider some prerequisites.

Getting ready

For general information about regular expression support in C++11, refer to the Verifying the format of a string using regular expressions recipe, earlier in this chapter. Basic knowledge of regular expressions is required to proceed with this recipe.

In the following examples, text is a variable that's defined as follows:

auto text {
  R"(
    #remove # to uncomment a line
    timeout=120
    server = 127.0.0.1
    #retrycount=3
  )"s};

The sole purpose of this is to simplify our snippets, although in a real-world example, you will probably be reading the text from a file or other source.

How to do it...

In order to search for occurrences of a regular expression through a string, you should do the following:

  1. Include the headers <regex> and <string> and the namespace std::string_literals for C++14 standard user-defined literals for strings:
    #include <regex>
    #include <string>
    using namespace std::string_literals;
    
  2. Use raw string literals to specify a regular expression in order to avoid escaping backslashes (which can occur frequently). The following regular expression validates the file format proposed earlier:
    auto pattern {R"(^(?!#)(\w+)\s*=\s*([\w\d]+[\w\d._,\-:]*)$)"s};
    
  3. Create an std::regex/std::wregex object (depending on the character set that is used) to encapsulate the regular expression:
    auto rx = std::regex{pattern};
    
  4. To search for the first occurrence of a regular expression in a given text, use the general-purpose algorithm std::regex_search() (example 1):
    auto match = std::smatch{};
    if (std::regex_search(text, match, rx))
    {
      std::cout << match[1] << '=' << match[2] << '\n';
    }
    
  5. To find all the occurrences of a regular expression in a given text, use the iterator std::regex_iterator (example 2):
    auto end = std::sregex_iterator{};
    for (auto it=std::sregex_iterator{ std::begin(text),
                                       std::end(text), rx };
         it != end; ++it)
    {
      std::cout << '\'' << (*it)[1] << "'='"
                << (*it)[2] << '\'' << '\n';
    }
    
  6. To iterate through all the subexpressions of a match, use the iterator std::regex_token_iterator (example 3):
    auto end = std::sregex_token_iterator{};
    for (auto it = std::sregex_token_iterator{
                      std::begin(text), std::end(text), rx };
         it != end; ++it)
    {
      std::cout << *it << '\n';
    }
    

How it works...

A simple regular expression that can parse the input file shown earlier may look like this:

^(?!#)(\w+)\s*=\s*([\w\d]+[\w\d._,\-:]*)$

This regular expression is supposed to ignore all lines that start with a #; for those that do not start with #, match a name followed by the equals sign and then a value that can be composed of alphanumeric characters and several other characters (underscore, dot, comma, and so on). The exact meaning of this regular expression is explained as follows:

Part

Description

^

Start of line.

(?!#)

A negative lookahead that makes sure that it is not possible to match the # character.

(\w)+

A capturing group representing an identifier of at least a one-word character.

\s*

Any whitespaces.

=

Equals sign.

\s*

Any whitespaces.

([\w\d]+[\w\d._,\-:]*)

A capturing group representing a value that starts with an alphanumeric character, but can also contain a dot, comma, backslash, hyphen, colon, or an underscore.

$

End of line.

We can use std::regex_search() to search for a match anywhere in the input text. This algorithm has several overloads, but in general, they work in the same way. You must specify the range of characters to work through, an output std::match_results object that will contain the result of the match, and an std::basic_regex object representing the regular expression and matching flags (which define the way the search is done). The function returns true if a match was found or false otherwise.

In the first example from the previous section (see the fourth list item), match is an instance of std::smatch that is a typedef of std::match_results with string::const_iterator as the template type. If a match was found, this object will contain the matching information in a sequence of values for all matched subexpressions. The submatch at index 0 is always the entire match. The submatch at index 1 is the first subexpression that was matched, the submatch at index 2 is the second subexpression that was matched, and so on. Since we have two capturing groups (which are subexpressions) in our regular expression, the std::match_results will have three submatches in the event of success. The identifier representing the name is at index 1, and the value after the equals sign is at index 2. Therefore, this code only prints the following:

Figure 2.4: Output of first example

The std::regex_search() algorithm is not able to iterate through all the possible matches in a piece of text. To do that, we need to use an iterator. std::regex_iterator is intended for this purpose. It allows not only iterating through all the matches, but also accessing all the submatches of a match.

The iterator actually calls std::regex_search() upon construction and on each increment, and it remembers the resulting std::match_results from the call. The default constructor creates an iterator that represents the end of the sequence and can be used to test when the loop through the matches should stop.

In the second example from the previous section (see the fifth list item), we first create an end-of-sequence iterator, and then we start iterating through all the possible matches. When constructed, it will call std::regex_match(), and if a match is found, we can access its results through the current iterator. This will continue until no match is found (the end of the sequence). This code will print the following output:

Figure 2.5: Output of second example

An alternative to std::regex_iterator is std::regex_token_iterator. This works similar to the way std::regex_iterator works and, in fact, it contains such an iterator internally, except that it enables us to access a particular subexpression from a match. This is shown in the third example in the How to do it... section (see the sixth list item). We start by creating an end-of-sequence iterator and then loop through the matches until the end-of-sequence is reached. In the constructor we used, we did not specify the index of the subexpression to access through the iterator; therefore, the default value of 0 is used. This means this program will print all the matches:

Figure 2.6: Output of third example

If we wanted to access only the first subexpression (this means the names in our case), all we had to do was specify the index of the subexpression in the constructor of the token iterator, as shown here:

auto end = std::sregex_token_iterator{};
for (auto it = std::sregex_token_iterator{ std::begin(text),
               std::end(text), rx, 1 };
     it != end; ++it)
{
  std::cout << *it << '\n';
}

This time, the output that we get contains only the names. This is shown in the following image:

Figure 2.7: Output containing only the names

An interesting thing about the token iterator is that it can return the unmatched parts of the string if the index of the subexpressions is -1, in which case it returns an std::match_results object that corresponds to the sequence of characters between the last match and the end of the sequence:

auto end = std::sregex_token_iterator{};
for (auto it = std::sregex_token_iterator{ std::begin(text),
               std::end(text), rx, -1 };
     it != end; ++it)
{
  std::cout << *it << '\n';
}

This program will output the following:

Figure 2.8: Output including empty lines

Please note that the empty lines in the output correspond to empty tokens.

See also

  • Verifying the format of a string using regular expressions to familiarize yourself with the C++ library support for working with regular expressions
  • Replacing the content of a string using regular expressions to learn how to perform multiple matches of a pattern in a text

Replacing the content of a string using regular expressions

In the previous two recipes, we looked at how to match a regular expression on a string or a part of a string and iterate through matches and submatches. The regular expression library also supports text replacement based on regular expressions. In this recipe, we will learn how to use std::regex_replace() to perform such text transformations.

Getting ready

For general information about regular expressions support in C++11, refer to the Verifying the format of a string using regular expressions recipe, earlier in this chapter.

How to do it...

In order to perform text transformations using regular expressions, you should perform the following:

  • Include <regex> and <string> and the namespace std::string_literals for C++14 standard user-defined literals for strings:
    #include <regex>
    #include <string>
    using namespace std::string_literals;
    
  • Use the std::regex_replace() algorithm with a replacement string as the third argument. Consider this example: replace all words composed of exactly three characters that are either a, b, or c with three hyphens:
    auto text{"abc aa bca ca bbbb"s};
    auto rx = std::regex{ R"(\b[a|b|c]{3}\b)"s };
    auto newtext = std::regex_replace(text, rx, "---"s);
    
  • Use the std::regex_replace() algorithm with match identifiers prefixed with a $ for the third argument. For example, replace names in the format "lastname, firstname" with names in the format "firstname lastname", as follows:
    auto text{ "bancila, marius"s };
    auto rx = std::regex{ R"((\w+),\s*(\w+))"s };
    auto newtext = std::regex_replace(text, rx, "$2 $1"s);
    

How it works...

The std::regex_replace() algorithm has several overloads with different types of parameters, but the meaning of the parameters is as follows:

  • The input string on which the replacement is performed.
  • An std::basic_regex object that encapsulates the regular expression used to identify the parts of the strings to be replaced.
  • The string format used for replacement.
  • Optional matching flags.

The return value is, depending on the overload used, either a string or a copy of the output iterator provided as an argument. The string format used for replacement can either be a simple string or a match identifier, indicated with a $ prefix:

  • $& indicates the entire match.
  • $1, $2, $3, and so on indicate the first, second, and third submatches, and so on.
  • $` indicates the part of the string before the first match.
  • $' indicates the part of the string after the last match.

In the first example shown in the How to do it... section, the initial text contains two words made of exactly three a, b, and c characters, abc and bca. The regular expression indicates an expression of exactly three characters between word boundaries. This means a subtext, such as bbbb, will not match the expression. The result of the replacement is that the string text will be --- aa --- ca bbbb.

Additional flags for the match can be specified for the std::regex_replace() algorithm. By default, the matching flag is std::regex_constants::match_default, which basically specifies ECMAScript as the grammar used for constructing the regular expression. If we want, for instance, to replace only the first occurrence, then we can specify std::regex_constants::format_first_only. In the following example, the result is --- aa bca ca bbbb as the replacement stops after the first match is found:

auto text{ "abc aa bca ca bbbb"s };
auto rx = std::regex{ R"(\b[a|b|c]{3}\b)"s };
auto newtext = std::regex_replace(text, rx, "---"s,
                 std::regex_constants::format_first_only);

The replacement string, however, can contain special indicators for the whole match, a particular submatch, or the parts that were not matched, as explained earlier. In the second example shown in the How to do it... section, the regular expression identifies a word of at least one character, followed by a comma and possible white spaces, and then another word of at least one character. The first word is supposed to be the last name, while the second word is supposed to be the first name. The replacement string is in the $2 $1 format. This is an instruction that's used to replace the matched expression (in this example, the entire original string) with another string formed of the second submatch, followed by a space and then the first submatch.

In this case, the entire string was a match. In the following example, there will be multiple matches inside the string, and they will all be replaced with the indicated string. In this example, we are replacing the indefinite article a when preceding a word that starts with a vowel (this, of course, does not cover words that start with a vowel sound) with the indefinite article an:

auto text{"this is a example with a error"s};
auto rx = std::regex{R"(\ba ((a|e|i|u|o)\w+))"s};
auto newtext = std::regex_replace(text, rx, "an $1");

The regular expression identifies the letter a as a single word (\b indicates a word boundary, so \ba means a word with a single letter, a), followed by a space and a word of at least two characters starting with a vowel. When such a match is identified, it is replaced with a string formed of the fixed string an, followed by a space and the first subexpression of the match, which is the word itself. In this example, the newtext string will be this is an example with an error.

Apart from the identifiers of the subexpressions ($1, $2, and so on), there are other identifiers for the entire match ($&), the part of the string before the first match ($`), and the part of the string after the last match ($'). In the last example, we change the format of a date from dd.mm.yyyy to yyyy.mm.dd, but also show the matched parts:

auto text{"today is 1.06.2016!!"s};
auto rx =
   std::regex{R"((\d{1,2})(\.|-|/)(\d{1,2})(\.|-|/)(\d{4}))"s};
// today is 2016.06.1!!
auto newtext1 = std::regex_replace(text, rx, R"($5$4$3$2$1)");
// today is [today is ][1.06.2016][!!]!!
auto newtext2 = std::regex_replace(text, rx, R"([$`][$&][$'])");

The regular expression matches a one- or two-digit number followed by a dot, hyphen, or slash; followed by another one- or two-digit number; then a dot, hyphen, or slash; and lastly a four-digit number.

For newtext1, the replacement string is $5$4$3$2$1; this means year, followed by the second separator, then month, the first separator, and finally day. Therefore, for the input string today is 1.06.2016!, the result is today is 2016.06.1!!.

For newtext2, the replacement string is [$`][$&][$']; this means the part before the first match, followed by the entire match, and finally the part after the last match, are in square brackets. However, the result is not [!!][1.06.2016][today is ] as you perhaps might expect at first glance, but today is [today is ][1.06.2016][!!]!!. The reason for this is that what is replaced is the matched expression, and, in this case, that is only the date (1.06.2016). This substring is replaced with another string formed of all the parts of the initial string.

See also

  • Verifying the format of a string using regular expressions to familiarize yourself with the C++ library support for working with regular expressions
  • Parsing the content of a string using regular expressions to learn how to perform multiple matches of a pattern in a text

Using string_view instead of constant string references

When working with strings, temporary objects are created all the time, even if you might not be really aware of it. Many times, these temporary objects are irrelevant and only serve the purpose of copying data from one place to another (for example, from a function to its caller). This represents a performance issue because they require memory allocation and data copying, which should be avoided. For this purpose, the C++17 standard provides a new string class template called std::basic_string_view that represents a non-owning constant reference to a string (that is, a sequence of characters). In this recipe, you will learn when and how you should use this class.

Getting ready

The string_view class is available in the namespace std in the string_view header.

How to do it...

You should use std::string_view to pass a parameter to a function (or return a value from a function), instead of std::string const &, unless your code needs to call other functions that take std::string parameters (in which case, conversions would be necessary):

std::string_view get_filename(std::string_view str)
{
  auto const pos1 {str.find_last_of('')};
  auto const pos2 {str.find_last_of('.')};
  return str.substr(pos1 + 1, pos2 - pos1 - 1);
}
char const file1[] {R"(c:\test\example1.doc)"};
auto name1 = get_filename(file1);
std::string file2 {R"(c:\test\example2)"};
auto name2 = get_filename(file2);
auto name3 = get_filename(std::string_view{file1, 16});

How it works...

Before we look at how the new string type works, let's consider the following example of a function that is supposed to extract the name of a file without its extension. This is basically how you would write the function from the previous section before C++17:

std::string get_filename(std::string const & str)
{
  auto const pos1 {str.find_last_of('\\')};
  auto const pos2 {str.find_last_of('.')};
  return str.substr(pos1 + 1, pos2 - pos1 - 1);
}
auto name1 = get_filename(R"(c:\test\example1.doc)"); // example1
auto name2 = get_filename(R"(c:\test\example2)");     // example2
if(get_filename(R"(c:\test\_sample_.tmp)").front() == '_') {}

Note that in this example, the file separator is \ (backslash), as in Windows. For Linux-based systems, it has to be changed to / (slash).

The get_filename() function is relatively simple. It takes a constant reference to an std::string and identifies a substring bounded by the last file separator and the last dot, which basically represents a filename without an extension (and without folder names).

The problem with this code, however, is that it creates one, two, or possibly even more temporaries, depending on the compiler optimizations. The function parameter is a constant std::string reference, but the function is called with a string literal, which means std::string needs to be constructed from the literal. These temporaries need to allocate and copy data, which is both time- and resource-consuming. In the last example, all we want to do is check whether the first character of the filename is an underscore, but we create at least two temporary string objects for that purpose.

The std::basic_string_view class template is intended to solve this problem. This class template is very similar to std::basic_string, with the two having almost the same interface. The reason for this is that std::basic_string_view is intended to be used instead of a constant reference to an std::basic_string without further code changes. Just like with std::basic_string, there are specializations for all types of standard characters:

typedef basic_string_view<char>     string_view;
typedef basic_string_view<wchar_t>  wstring_view;
typedef basic_string_view<char16_t> u16string_view;
typedef basic_string_view<char32_t> u32string_view;

The std::basic_string_view class template defines a reference to a constant contiguous sequence of characters. As the name implies, it represents a view and cannot be used to modify the reference sequence of characters. An std::basic_string_view object has a relatively small size because all that it needs is a pointer to the first character in the sequence and the length. It can be constructed not only from an std::basic_string object but also from a pointer and a length, or from a null-terminated sequence of characters (in which case, it will require an initial traversal of the string in order to find the length). Therefore, the std::basic_string_view class template can also be used as a common interface for multiple types of strings (as long as data only needs to be read). On the other hand, converting from an std::basic_string_view to an std::basic_string is not possible.

You must explicitly construct an std::basic_string object from a std::basic_string_view, as shown in the following example:

std::string_view sv{ "demo" };
std::string s{ sv };

Passing std::basic_string_view to functions and returning std::basic_string_view still creates temporaries of this type, but these are small-sized objects on the stack (a pointer and a size could be 16 bytes for 64-bit platforms); therefore, they should incur fewer performance costs than allocating heap space and copying data.

Note that all major compilers provide an implementation of std::basic_string, which includes a small string optimization. Although the implementation details are different, they typically rely on having a statically allocated buffer of a number of characters (16 for VC++ and GCC 5 or newer) that does not involve heap operations, which are only required when the size of the string exceeds that number of characters.

In addition to the methods that are identical to those available in std::basic_string, the std::basic_string_view has two more:

  • remove_prefix(): Shrinks the view by incrementing the start with N characters and decrementing the length with N characters.
  • remove_suffix(): Shrinks the view by decrementing the length with N characters.

The two member functions are used in the following example to trim an std::string_view from spaces, both at the beginning and the end. The implementation of the function first looks for the first element that is not a space and then for the last element that is not a space. Then, it removes from the end everything after the last non-space character, and from the beginning everything until the first non-space character. The function returns the new view, trimmed at both ends:

std::string_view trim_view(std::string_view str)
{
  auto const pos1{ str.find_first_not_of(" ") };
  auto const pos2{ str.find_last_not_of(" ") };
  str.remove_suffix(str.length() - pos2 - 1);
  str.remove_prefix(pos1);
  return str;
}
auto sv1{ trim_view("sample") };
auto sv2{ trim_view("  sample") };
auto sv3{ trim_view("sample  ") };
auto sv4{ trim_view("  sample  ") };
std::string s1{ sv1 };
std::string s2{ sv2 };
std::string s3{ sv3 };
std::string s4{ sv4 };

When using std::basic_string_view, you must be aware of two things: you cannot change the underlying data referred to by a view and you must manage the lifetime of the data, as the view is a non-owning reference.

See also

  • Creating a library of string helpers to see how to create useful text utilities that are not directly available in the standard library

Formatting text with std::format

The C++ language has two ways of formatting text: the printf family of functions and the I/O streams library. The printf functions are inherited from C and provide a separation of the formatting text and the arguments. The streams library provides safety and extensibility and is usually recommended over printf functions, but is, in general, slower. The C++20 standard proposes a new formatting library alternative for output formatting, which is similar in form to printf but safe and extensible and is intended to complement the existing streams library. In this recipe, we will learn how to use the new functionalities instead of the printf functions or the streams library.

Getting ready

The new formatting library is available in the header <format>. You must include this header for the following samples to work.

How to do it...

The std::format() function formats its arguments according to the provided formatting string. You can use it as follows:

  • Provide empty replacement fields, represented by {}, in the format string for each argument:
    auto text = std::format("{} is {}", "John", 42);
    
  • Specify the 0-based index of each argument in the argument list inside the replacement field, such as {0}, {1}, and so on. The order of the arguments is not important, but the index must be valid:
    auto text = std::format("{0} is {1}", "John", 42);
    
  • Control the output text with format specifiers provided in the replacement field after a colon (:). For basic and string types, this is a standard format specification. For chrono types, this is a chrono format specification:
    auto text = std::format("{0} hex is {0:08X}", 42);
    auto now = std::chrono::system_clock::now();
    auto time = std::chrono::system_clock::to_time_t(now);
    auto text = std::format("Today is {:%Y-%m-%d}", *std::localtime(&time));
    

You can also write the arguments in an out format using an iterator with either std::format_to() or std::format_to_n(), as follows:

  • Write to a buffer, such as an std::string or std::vector<char>, using std::format_n() and using the std::back_inserter() helper function:
    std::vector<char> buf;
    std::format_to(std::back_inserter(buf), "{} is {}", "John", 42);
    
  • Use std::formatted_size() to retrieve the number of characters necessary to store the formatted representation of the arguments:
    auto size = std::formatted_size("{} is {}", "John", 42);
    std::vector<char> buf(size);
    std::format_to(buf.data(), "{} is {}", "John", 42);
    
  • To limit the number of characters written to the output buffer, you can use std::format_to_n(), which is similar to std::format_to() but writes, at most, n characters:
    char buf[100];
    auto result = std::format_to_n(buf, sizeof(buf), "{} is {}", "John", 42);
    

How it works...

The std::format() function has multiple overloads. You can specify the format string either as a string view or a wide string view, with the function returning either an std::string or an std::wstring. You can also specify, as the first argument, an std::locale, which is used for locale-specific formatting. The function overloads are all variadic function templates, which means you can specify any number of arguments after the format.

The format string consists of ordinary characters, replacement fields, and escape sequences. The escape sequences are {{ and }} and are replaced with { and } in the output. A replacement field is provided within curly brackets {}. It can optionally contain a non-negative number, representing the 0-based index of the argument to be formatted, and a colon (:), followed by a format specifier. If the format specifier is invalid, an exception of the type std::format_error is thrown.

In a similar manner, std::format_to() has multiple overloads, just like std::format(). The difference between these two is that std::format_to() always takes an iterator to the output buffer as the first argument and returns an iterator past the end of the output range (and not a string as std::format() does). On the other hand, std::format_to_n() has one more parameter than std::format_to(). Its second parameter is a number representing the maximum number of characters to be written to the buffer.

The following listing shows the signature of the simplest overload of each of these three function templates:

template<class... Args>
std::string format(std::string_view fmt, const Args&... args);
template<class OutputIt, class... Args>
OutputIt format_to(OutputIt out,
                   std::string_view fmt, const Args&... args);
template<class OutputIt, class... Args>
std::format_to_n_result<OutputIt>
format_to_n(OutputIt out, std::iter_difference_t<OutputIt> n,
            std::string_view fmt, const Args&... args);

When you provide the format string, you can supply argument identifiers (their 0-based index) or omit them. However, it is illegal to use both. If the indexes are omitted in the replacement fields, the arguments are processed in the provided order, and the number of replacement fields must not be greater than the number of supplied arguments. If indexes are provided, they must be valid for the format string to be valid.

When a format specification is used, then:

  • For basic types and string types, it is considered to be a standard format specification.
  • For chrono types, it is considered to be a chrono format specification.
  • For user-defined types, it is defined by a user-defined specialization of the std::formatter class for the desired type.

The standard format specification is based on the format specification in Python and has the following syntax:

fill-and-align(optional) sign(optional) #(optional) 0(optional) width(optional) precision(optional) L(optional) type(optional)

These syntax parts are briefly described here.

fill-and-align is an optional fill character, followed by one of the align options:

  • <: Forces the field to be left-aligned with the available space.
  • >: Forces the field to be right-aligned with the available space.
  • ^: Forces the field to be centered with the available space. To do so, it will insert n/2 characters to the left and n/2 characters to the right:
    auto t1 = std::format("{:5}", 42);    // "   42"
    auto t2 = std::format("{:5}", 'x');   // "x    "
    auto t3 = std::format("{:*<5}", 'x'); // "x****"
    auto t4 = std::format("{:*>5}", 'x'); // "****x"
    auto t5 = std::format("{:*^5}", 'x'); // "**x**"
    auto t6 = std::format("{:5}", true);  // "true "
    

sign, #, and 0 are only valid when a number (either an integer or a floating-point) is used. The sign can be one of:

  • +: Specifies that the sign must be used for both negative and positive numbers.
  • -: Specifies that the sign must be used only for negative numbers (which is the implicit behavior).
  • A space: Specifies that the sign must be used for negative numbers and that a leading space must be used for non-negative numbers:
    auto t7 = std::format("{0:},{0:+},{0:-},{0: }", 42);
    // "42,+42,42, 42"
    auto t8 = std::format("{0:},{0:+},{0:-},{0: }", -42);
    // "-42,-42,-42,-42"
    

The symbol # causes the alternate form to be used. This can be one of the following:

  • For integral types, when binary, octal, or hexadecimal representation is specified, the alternate form adds the prefix 0b, 0, or 0x to the output.
  • For floating-point types, the alternate form causes a decimal-point character to always be present in the formatted value, even if no digits follow it. In addition, when g or G are used, the trailing zeros are not removed from the output.

The digit 0 specifies that leading zeros should be outputted to the field width, except when the value of a floating-point type is infinity or NaN. When present alongside an align option, the specifier 0 is ignored:

auto t9  = std::format("{:+05d}", 42); // "+0042"
auto t10 = std::format("{:#05x}", 42); // "0x02a"
auto t11 = std::format("{:<05}", -42); // "-42  "

width specifies the minimum field width and can be either a positive decimal number or a nested replacement field. The precision field indicates the precision for floating-point types or, for string types, how many characters will be used from the string. It is specified with a dot (.), followed by a non-negative decimal number or a nested replacement field.

Locale-specific formatting is specified with the uppercase L and causes the locale-specific form to be used. This option is only available for arithmetic types.

The optional type determines how the data will be presented in the output. The available string presentation types are shown in the following table:

Type

Presentation type

Description

Strings

none, s

Copies the string to the output.

Integral types

B

Binary format with 0b as a prefix.

B

Binary format with 0B as a prefix.

C

Character format. Copies the value to the output as it was a character type.

none or d

Decimal format.

O

Octal format with 0 as a prefix (unless the value is 0).

X

Hexadecimal format with 0x as a prefix.

X

Hexadecimal format with 0X as a prefix.

char and wchar_t

none or c

Copies the character to the output.

b, B, c, d, o, x, X

Integer presentation types.

bool

none or s

Copies true or false as a textual representation (or their local-specific form) to the output.

b, B, c, d, o, x, X

Integer presentation types.

Floating-point

A

Hexadecimal representation. Same as if calling std::to_chars(first, last, value, std::chars_format::hex, precision) or std::to_chars(first, last, value, std::chars_format::hex), depending on whether precision is specified or not.

A

Same as a except that it uses uppercase letters for digits above 9 and uses P to indicate the exponent.

E

Scientific representation. Produces the output as if calling std::to_chars(first, last, value, std::chars_format::scientific, precision).

E

Similar to e except that it uses E to indicate the exponent.

f, F

Fixed representation. Produces the output as if by calling std::to_chars(first, last, value, std::chars_format::fixed, precision). When no precision is specified, the default is 6.

G

General floating-point representation. Produces the output as if by calling std::to_chars(first, last, value, std::chars_format::general, precision). When no precision is specified, the default is 6.

G

Same as g except that it uses E to indicate the exponent.

Pointer

none or p

Pointer representation. Produces the output as if by calling std::to_chars(first, last, reinterpret_cast<std::uintptr_t>(value), 16) with the prefix 0x added to the output. This is available only when std::uintptr_t is defined; otherwise, the output is implementation-defined.

The chrono format specification has the following form:

fill-and-align(optional) width(optional) precision(optional) chrono-spec(optional)

The fill-and-align, width, and precision fields have the same meaning as in the standard format specification, described previously. The precision is only valid for std::chrono::duration types when the representation type is a floating-point type. Using it in other cases throws an std::format_error exception.

The chrono specification can be empty, in which case the argument is formatted as if by streaming it to an std::stringstream and copying the result string. Alternatively, it can consist of a series of conversion specifiers and ordinary characters. Some of these format specifiers are presented in the following table:

Conversion specifier

Description

%%

Writes a literal % character.

%n

Writes a newline character.

%t

Writes a horizontal tab character.

%Y

Writes the year as a decimal number. If the result is less than four digits, it is left-padded with 0 to four digits.

%m

Writes the month as a decimal number (January is 01). If the result is a single digit, it is prefixed with 0.

%d

Writes the day of month as a decimal number. If the result is a single decimal digit, it is prefixed with 0.

%w

Writes the weekday as a decimal number (0-6), where Sunday is 0.

%D

Equivalent to %m/%d/%y.

%F

Equivalent to %Y-%m-%d.

%H

Writes the hour (24-hour clock) as a decimal number. If the result is a single digit, it is prefixed with 0.

%I

Writes the hour (12-hour clock) as a decimal number. If the result is a single digit, it is prefixed with 0.

%M

Writes the minute as a decimal number. If the result is a single digit, it is prefixed with 0.

%S

Writes the second as a decimal number. If the number of seconds is less than 10, the result is prefixed with 0.

%R

Equivalent to %H:%M.

%T

Equivalent to %H:%M:%S.

%X

Writes the locale's time representation.

The complete list of format specifiers for the chrono library can be consulted at https://en.cppreference.com/w/cpp/chrono/system_clock/formatter.

See also

  • Using std::format with user-defined types to learn how to create custom formatting specialization for user-defined types
  • Converting between numeric and string types to learn how to convert between numbers and strings

Using std::format with user-defined types

The C++20 formatting library is a modern alternative to using printf-like functions or the I/O streams library, which it actually complements. Although the standard provides default formatting for basic types, such as integral and floating-point types, bool, character types, strings, and chrono types, the user can create custom specialization for user-defined types. In this recipe, we will learn how to do that.

Getting ready

You should read the previous recipe, Formatting text with std::format, to familiarize yourself with the formatting library.

In the examples that we'll be showing here, we will use the following class:

struct employee
{
   int         id;
   std::string firstName;
   std::string lastName;
};

In the next section, we'll introduce the necessary steps to implement to enable text formatting using std::format() for user-defined types.

How to do it...

To enable formatting using the new formatting library for user-defined types, you must do the following:

  • Define a specialization of the std::formatter<T, CharT> class in the std namespace.
  • Implement the parse() method to parse the portion of the format string corresponding to the current argument. If the class inherits from another formatter, then this method can be omitted.
  • Implement the format() method to format the argument and write the output via format_context.

For the employee class listed here, a formatter that formats employee to the form [42] John Doe (that is [id] firstName lastName) can be implemented as follows:

template <>
struct std::formatter<employee>
{
   constexpr auto parse(format_parse_context& ctx)
   {
      return ctx.begin();
   }
   auto format(employee const & value, format_context& ctx) {
      return std::format_to(ctx.out(),
                            "[{}] {} {}",
                            e.id, e.firstName, e.lastName);
   }
};

How it works...

The formatting library uses the std::formatter<T, CharT> class template to define formatting rules for a given type. Built-in types, string types, and chrono types have formatters provided by the library. These are implemented as specializations of the std::formatter<T, CharT> class template.

This class has two methods:

  • parse(), which takes a single argument of the type std::basic_format_parse_context<CharT> and parses the format's specification for the type T, provided by the parse context. The result of the parsing is supposed to be stored in member fields of the class. If the parsing succeeds, this function should return a value of the type std::basic_format_parse_context<CharT>::iterator, which represents the end of the format specification. If the parsing fails, the function should throw an exception of the type std::format_error to provide details about the error.
  • format(), which takes two arguments, the first being the object of the type T to format and the second being a formatting context object of the type std::basic_format_context<OutputIt, CharT>. This function should write the output to ctx.out() according to the desired specifiers (which could be something implicit or the result of parsing the format specification). The function must return a value of the type std::basic_format_context<OutputIt, CharT>::iterator, representing the end of the output.

In the implementation shown here, the parse() function does not do anything other than return an iterator representing the beginning of the format specification. The formatting is always done by printing the employee identifier between square brackets, followed by the first name and the last name, such as in [42] John Doe. An attempt to use a format specifier would result in a runtime exception:

employee e{ 42, "John", "Doe" };
auto s1 = std::format("{}", e);   // [42] John Doe
auto s2 = std::format("{:L}", e); // error

If you want your user-defined types to support format specifiers, then you must properly implement the parse() method. To show how this can be done, we will support the L specifier for the employee class. When this specifier is used, the employee is formatted with the identifier in square brackets, followed by the last name, a comma, and then the first name, such as in [42] Doe, John:

template<>
struct std::formatter<employee>
{
   bool lexicographic_order = false;
   template <typename ParseContext>
   constexpr auto parse(ParseContext& ctx)
   {
      auto iter = ctx.begin();
      auto get_char = [&]() { return iter != ctx.end() ? *iter : 0; };
      if (get_char() == ':') ++iter;
      char c = get_char();
      switch (c)
      {
      case '}': return ++iter;
      case 'L': lexicographic_order = true; return ++iter;
      case '{': return ++iter;
      default: throw std::format_error("invalid format");
      }
   }
   template <typename FormatContext>
   auto format(employee const& e, FormatContext& ctx)
   {
      if(lexicographic_order)
         return std::format_to(ctx.out(), "[{}] {}, {}",
                               e.id, e.lastName, e.firstName);
      return std::format_to(ctx.out(), "[{}] {} {}",
                            e.id, e.firstName, e.lastName);
   }
};

With this defined, the preceding sample code would work. However, using other format specifiers, such as A, for example, would still throw an exception:

auto s1 = std::format("{}", e);   // [42] John Doe
auto s2 = std::format("{:L}", e); // [42] Doe, John
auto s3 = std::format("{:A}", e); // error (invalid format)

If you do not need to parse the format specifier in order to support various options, you could entirely omit the parse() method. However, in order to do so, your std::formatter specialization must derive from another std::formatter class. An implementation is shown here:

template<>
struct fmt::formatter<employee> : fmt::formatter<char const*>
{
   template <typename FormatContext>
   auto format(employee const& e, FormatContext& ctx)
   {
      return std::format_to(ctx.out(), "[{}] {} {}",
                            e.id, e.firstName, e.lastName);
   }
};

This specialization for the employee class is equivalent to the first implementation shown earlier in the How to do it... section.

See also

  • Formatting text with std::format to get a good introduction to the new C++20 text formatting library
Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Explore the latest language and library features of C++20 such as modules, coroutines, concepts, and ranges
  • Shed new light on the core concepts in C++ programming, including functions, algorithms, threading, and concurrency, through practical self-contained recipes
  • Leverage C++ features like smart pointers, move semantics, constexpr, and more for increased robustness and performance

Description

C++ has come a long way to be one of the most widely used general-purpose languages that is fast, efficient, and high-performance at its core. The updated second edition of Modern C++ Programming Cookbook addresses the latest features of C++20, such as modules, concepts, coroutines, and the many additions to the standard library, including ranges and text formatting. The book is organized in the form of practical recipes covering a wide range of problems faced by modern developers. The book also delves into the details of all the core concepts in modern C++ programming, such as functions and classes, iterators and algorithms, streams and the file system, threading and concurrency, smart pointers and move semantics, and many others. It goes into the performance aspects of programming in depth, teaching developers how to write fast and lean code with the help of best practices. Furthermore, the book explores useful patterns and delves into the implementation of many idioms, including pimpl, named parameter, and attorney-client, teaching techniques such as avoiding repetition with the factory pattern. There is also a chapter dedicated to unit testing, where you are introduced to three of the most widely used libraries for C++: Boost.Test, Google Test, and Catch2. By the end of the book, you will be able to effectively leverage the features and techniques of C++11/14/17/20 programming to enhance the performance, scalability, and efficiency of your applications.

What you will learn

  • Understand the new C++20 language and library features and the problems they solve
  • Become skilled at using the standard support for threading and concurrency for daily tasks
  • Leverage the standard library and work with containers, algorithms, and iterators
  • Solve text searching and replacement problems using regular expressions
  • Work with different types of strings and learn the various aspects of compilation
  • Take advantage of the file system library to work with files and directories
  • Implement various useful patterns and idioms
  • Explore the widely used testing frameworks for C++

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 11, 2020
Length pages
Edition : 2nd Edition
Language : English
ISBN-13 : 9781800208988
Category :
Languages :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon AI Assistant (beta) to help accelerate your learning
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want