Reader small image

You're reading from  Protocol Buffers Handbook

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781805124672
Edition1st Edition
Right arrow
Author (1)
Clément Jean
Clément Jean
author image
Clément Jean

Clément Jean is the CTO of Education for Ethiopia, a start-up focusing on educating K-12 students in Ethiopia. On top of that, he is also an online instructor (on Udemy, Linux Foundation, and others) teaching people about diff erent kinds of technologies. In both his occupations, he deals with technologies such as Protobuf and gRPC and how to apply them to real-life use cases. His overall goal is to empower people through education and technology.
Read more about Clément Jean

Right arrow

Serialization Internals

Now that we know how to describe data in Protobuf text format and encode it into binary, we have all the tools we need to learn about the serialization internals. These internals are important to learn because there are a lot of trade-offs between the different types and we need to be aware of them to define efficient schemas.

In this chapter, we’re going to cover the following main topics:

  • Variable-length integers
  • ZigZag encoding
  • Fixed-size integers
  • How to choose between integer types
  • Length-delimited encoding
  • Packed versus unpacked repeated fields
  • Maps

By the end of this chapter, you will know how Protobuf encodes/decodes data to/from binary and you will understand the output binary by yourself.

Technical requirements

All the code examples that you will see in this chapter can be found in the chapter5 directory in this book’s GitHub repository (https://github.com/PacktPublishing/Protocol-Buffers-Handbook).

Variable-length integers (varints)

As you are now aware, the payloads that are created by Protobuf are significantly smaller than the other popular data formats. One of the biggest factors of such small payloads is the use of variable-length integers (varints). Now, let’s not get ahead of ourselves. Before explaining how all of this works in Protobuf itself, let’s understand the idea of varints; then, we will see where they’re used in Protobuf.

As its name suggests, a varint is the concept of encoding integers into different byte sizes. What is not clear from the name is how it decides the length of the encoding. So, we are going to use an example to understand how that works.

First, we can see the result of encoding by using the skills we learned in previous chapters. We can write the following proto file (varint/encoding.proto):

syntax = "proto3";
message Encoding {
  int32 i32 = 1;
}

Then, we can define the data in a txtpb file...

ZigZag encoding

As we saw in the previous section, int32 and int64 are not efficient at storing negative numbers. They will always result in 10-byte-long payloads. To solve this specific use case of negative numbers, Protobuf introduces two other types: sint32 and sint64. The “s” stands for signed and they handle negative numbers.

The reason why they handle negative numbers more efficiently is that they add an extra step on top of varint encoding. This extra step, called ZigZag encoding, consists of turning all negative numbers into positive ones, and because varint encoding is very good at encoding positive numbers, we solved the problem.

Now, as usual, let’s see an example of ZigZag encoding. Let’s take our cherished 128. We have the following binary:

00000000 10000000

Now, let’s left shift by one:

00000001 00000000

We will then take the original binary and apply a right shift of 31 in the case of int32 and 63 in the case of int64...

Fixed-size integers

The last type of integer type we’ll look at is the fixed-size integer type. These are pretty much encoded how integers/floating points are encoded into your computer memory. In this case, the 32 and 64 suffixes of the type names correspond to the number of bits the value will be encoded in.

The types that are encoded into fixed-size integers are fixed32, fixed64, sfixed32, sfixed64, float, and double. The main thing to talk about here is the difference between sfixed and fixed. The former is signed, meaning that it can contain positive and negative numbers. The latter is unsigned, which means it can only contain positive numbers.

Let’s look at an example, just to ensure that we’re on the same page about encoding fixed-size numbers. If we have the following message (fixed/encoding.proto):

syntax = "proto3";
message Encoding {
  fixed32 f32 = 1;
}

We set the value 128 to f32 (fixed32.txtpb):

f32: 128

We run...

How to choose between integer types

Now that we know the two major algorithms behind integer encoding, we can reflect a little bit on how to choose between them. We will cover the three considerations that we need to think about when we decide between the integer types: number range, sign, and data distribution.

Number range

As we saw, Protobuf’s 32 and 64 suffixes on integer type names do not always represent the number of bits it takes to encode a value. We saw that it is better to think about them as the range of values that can be encoded.

This means that, when choosing an integer type, we need to be aware of the range of values needed for a specific use case. Let’s consider three examples:

  • Number of employees in a company
  • Request per second metric
  • Non-reusable IDs

For the first one, we can assume that our company will have less than 2 billion employees. The biggest companies in terms of employees, at the time of writing this book,...

Field metadata

So far, we haven’t talked too much about field tags. In this section, we’ll dive into how they are encoded and why they are encoded as such.

First, let’s get a small refresher on what field tags are. They are identifiers for fields that will help Protobuf know into which field to deserialize some data. So, let’s say we have the following field:

uint64 id = 1;

Protobuf decodes some specific data with an ID of 1 (tag), so it will know that this data is meant to be deserialized into the id field. All of this is an abstract explanation of what’s happening, so let’s understand concretely how the field for deserialization is selected.

First, we need to understand that Protobuf only serializes a combination of type, tag, and value. The name of a field is not serialized. We already know how integer values get serialized; later, we will see how it works for other types (string, repeated, and so on). For now, we can focus on...

Length-delimited encoding

So far, we’ve seen how to encode values that have static sizes. For example, when dealing with the encoding of an int32, Protobuf deals with 4 bytes and turns them into a variable number of bytes. The same is true with other number types. In this section, we are going to learn how to encode a value that has a dynamic size. In other words, a size that can only be determined at runtime.

The types with such a dynamic size are strings and bytes. However, some other parts of Protobuf are encoded with length-delimited encoding: embedded messages and packed repeated fields. We are going to talk about the latter in the next section, but we are going to see strings and embedded messages here.

Let’s look at an example of encoding strings in Protobuf. Once again, we are going to create a message (length_delimited/encoding.proto):

syntax = "proto3";
message Encoding {
  string s = 1;
}

We’re also going to describe the...

Packed versus unpacked repeated fields

One last important concept that is important to know is the concept of packed and unpacked repeated fields. As we know, repeated is the way we describe lists in Protobuf. A repeated modifier can be applied to a scalar type (int32, uint64, and so on) but can also be applied to more complex types (user-defined types, strings, and so on). The former will be encoded as a packed repeated field, and the latter will be unpacked.

Before going into more detail, let’s visualize the difference between both encodings. Let’s start with a packed repeated field. We will have a list of integers (repeated/encoding.proto):

syntax = "proto3";
message Encoding {
  repeated uint64 us = 1;
}

We can now set some values for it by describing the data in text format (repeated/packed.txtpb):

us: [1, 2, 3, 4, 5]

Now, let’s run the following command:

$ cat packed.txtpb | protoc --encode=Encoding encoding.proto | hexdump...

Maps

Finally, we can talk about how maps are encoded in Protobuf. In Chapter 3 on Protobuf text format, I briefly mentioned that a map is a list of objects that contains the key and value fields. In this section, we are going to dive deeper into this and see how maps are encoded.

First, let’s not take for granted that a map is a list of objects. Let’s investigate that. We can define a message containing a map field (map/encoding.proto):

syntax = "proto3";
message Encoding {
  map<string, int32> m = 1;
}

Now, to see how this translates internally, we can turn that proto file into a descriptor file. Protoc has a flag called --descriptor_set_out for doing that. Let’s create a descriptor file called encoding.desc:

$ protoc --descriptor_set_out=encoding.desc encoding.proto

This file contains a binary of FileDescriptorSet, which is a message defined in the descriptor.proto file provided with protoc. Now, we can decode this descriptor...

Summary

In this chapter, we learned about all the internal serialization/deserialization algorithms. We saw that there are multiple ways to encode integers and that is why we have that many integer types in Protobuf. After, we covered length-delimited encoding and how it relates to types such as strings, packed repeated fields, and embedded messages. Finally, we talked about unpacked repeated fields and their overhead.

In the next chapter, we will talk about schema evolution over time. We will lean on all the knowledge that we have right now to understand the problems that we might have when updating schemas and how we can overcome them.

Quiz

Answer the following questions to test your knowledge of this chapter:

  1. Which encoding algorithm outputs a variable number of bytes depending on the value encoded?
    1. ZigZag
    2. Varint
    3. Length-delimited
  2. Which encoding algorithm turns negative numbers into positive ones?
    1. Length-delimited
    2. varint
    3. ZigZag
  3. What might be a problem with using varint?
    1. It can use more bytes than the original 32- and 64-bit integers
    2. It will encode negative numbers into 10 bytes
    3. All the above
  4. What might be a problem with using ZigZag?
    1. It is less efficient at encoding positive numbers than varint
    2. It will encode negative numbers into 10 bytes
    3. It can use more bytes than the original 32- and 64-bit integers
  5. When should you consider using fixed-sized integers?
    1. Never, always prefer using varints
    2. When dealing with larger numbers which will be encoded a more than 4 or 8 bytes
    3. When dealing with negative numbers
  6. What is the difference between unpacked and packed repeated fields?
    1. Unpacked has overhead in terms of...

Answers

Here are the answers to this chapter’s questions:

  1. B
  2. C
  3. C
  4. A
  5. B
  6. A
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Protocol Buffers Handbook
Published in: Apr 2024Publisher: PacktISBN-13: 9781805124672
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Clément Jean

Clément Jean is the CTO of Education for Ethiopia, a start-up focusing on educating K-12 students in Ethiopia. On top of that, he is also an online instructor (on Udemy, Linux Foundation, and others) teaching people about diff erent kinds of technologies. In both his occupations, he deals with technologies such as Protobuf and gRPC and how to apply them to real-life use cases. His overall goal is to empower people through education and technology.
Read more about Clément Jean