You're reading from Protocol Buffers Handbook

Product typeBook

Published inApr 2024

PublisherPackt

ISBN-139781805124672

Edition1st Edition

Concepts

Application Development

Author (1)

Clément Jean

Serialization Internals

Now that we know how to describe data in Protobuf text format and encode it into binary, we have all the tools we need to learn about the serialization internals. These internals are important to learn because there are a lot of trade-offs between the different types and we need to be aware of them to define efficient schemas.

In this chapter, we’re going to cover the following main topics:

Variable-length integers
ZigZag encoding
Fixed-size integers
How to choose between integer types
Length-delimited encoding
Packed versus unpacked repeated fields
Maps

By the end of this chapter, you will know how Protobuf encodes/decodes data to/from binary and you will understand the output binary by yourself.

Technical requirements

All the code examples that you will see in this chapter can be found in the chapter5 directory in this book’s GitHub repository (https://github.com/PacktPublishing/Protocol-Buffers-Handbook).

Variable-length integers (varints)

As you are now aware, the payloads that are created by Protobuf are significantly smaller than the other popular data formats. One of the biggest factors of such small payloads is the use of variable-length integers (varints). Now, let’s not get ahead of ourselves. Before explaining how all of this works in Protobuf itself, let’s understand the idea of varints; then, we will see where they’re used in Protobuf.

As its name suggests, a varint is the concept of encoding integers into different byte sizes. What is not clear from the name is how it decides the length of the encoding. So, we are going to use an example to understand how that works.

First, we can see the result of encoding by using the skills we learned in previous chapters. We can write the following proto file (varint/encoding.proto):

syntax = "proto3";
message Encoding {
  int32 i32 = 1;
}

Then, we can define the data in a txtpb file...

ZigZag encoding

As we saw in the previous section, int32 and int64 are not efficient at storing negative numbers. They will always result in 10-byte-long payloads. To solve this specific use case of negative numbers, Protobuf introduces two other types: sint32 and sint64. The “s” stands for signed and they handle negative numbers.

The reason why they handle negative numbers more efficiently is that they add an extra step on top of varint encoding. This extra step, called ZigZag encoding, consists of turning all negative numbers into positive ones, and because varint encoding is very good at encoding positive numbers, we solved the problem.

Now, as usual, let’s see an example of ZigZag encoding. Let’s take our cherished 128. We have the following binary:

00000000 10000000

Now, let’s left shift by one:

00000001 00000000

We will then take the original binary and apply a right shift of 31 in the case of int32 and 63 in the case of int64...

Fixed-size integers

The last type of integer type we’ll look at is the fixed-size integer type. These are pretty much encoded how integers/floating points are encoded into your computer memory. In this case, the 32 and 64 suffixes of the type names correspond to the number of bits the value will be encoded in.

The types that are encoded into fixed-size integers are fixed32, fixed64, sfixed32, sfixed64, float, and double. The main thing to talk about here is the difference between sfixed and fixed. The former is signed, meaning that it can contain positive and negative numbers. The latter is unsigned, which means it can only contain positive numbers.

Let’s look at an example, just to ensure that we’re on the same page about encoding fixed-size numbers. If we have the following message (fixed/encoding.proto):

syntax = "proto3";
message Encoding {
  fixed32 f32 = 1;
}

We set the value 128 to f32 (fixed32.txtpb):

f32: 128

We run...

How to choose between integer types

Now that we know the two major algorithms behind integer encoding, we can reflect a little bit on how to choose between them. We will cover the three considerations that we need to think about when we decide between the integer types: number range, sign, and data distribution.

Number range

As we saw, Protobuf’s 32 and 64 suffixes on integer type names do not always represent the number of bits it takes to encode a value. We saw that it is better to think about them as the range of values that can be encoded.

This means that, when choosing an integer type, we need to be aware of the range of values needed for a specific use case. Let’s consider three examples:

Number of employees in a company
Request per second metric
Non-reusable IDs

For the first one, we can assume that our company will have less than 2 billion employees. The biggest companies in terms of employees, at the time of writing this book,...

Field metadata

So far, we haven’t talked too much about field tags. In this section, we’ll dive into how they are encoded and why they are encoded as such.

First, let’s get a small refresher on what field tags are. They are identifiers for fields that will help Protobuf know into which field to deserialize some data. So, let’s say we have the following field:

uint64 id = 1;

Protobuf decodes some specific data with an ID of 1 (tag), so it will know that this data is meant to be deserialized into the id field. All of this is an abstract explanation of what’s happening, so let’s understand concretely how the field for deserialization is selected.

First, we need to understand that Protobuf only serializes a combination of type, tag, and value. The name of a field is not serialized. We already know how integer values get serialized; later, we will see how it works for other types (string, repeated, and so on). For now, we can focus on...

Length-delimited encoding

So far, we’ve seen how to encode values that have static sizes. For example, when dealing with the encoding of an int32, Protobuf deals with 4 bytes and turns them into a variable number of bytes. The same is true with other number types. In this section, we are going to learn how to encode a value that has a dynamic size. In other words, a size that can only be determined at runtime.

The types with such a dynamic size are strings and bytes. However, some other parts of Protobuf are encoded with length-delimited encoding: embedded messages and packed repeated fields. We are going to talk about the latter in the next section, but we are going to see strings and embedded messages here.

Let’s look at an example of encoding strings in Protobuf. Once again, we are going to create a message (length_delimited/encoding.proto):

syntax = "proto3";
message Encoding {
  string s = 1;
}

We’re also going to describe the...

Packed versus unpacked repeated fields

One last important concept that is important to know is the concept of packed and unpacked repeated fields. As we know, repeated is the way we describe lists in Protobuf. A repeated modifier can be applied to a scalar type (int32, uint64, and so on) but can also be applied to more complex types (user-defined types, strings, and so on). The former will be encoded as a packed repeated field, and the latter will be unpacked.

Before going into more detail, let’s visualize the difference between both encodings. Let’s start with a packed repeated field. We will have a list of integers (repeated/encoding.proto):

syntax = "proto3";
message Encoding {
  repeated uint64 us = 1;
}

We can now set some values for it by describing the data in text format (repeated/packed.txtpb):

us: [1, 2, 3, 4, 5]

Now, let’s run the following command:

$ cat packed.txtpb | protoc --encode=Encoding encoding.proto | hexdump...

Maps

Finally, we can talk about how maps are encoded in Protobuf. In Chapter 3 on Protobuf text format, I briefly mentioned that a map is a list of objects that contains the key and value fields. In this section, we are going to dive deeper into this and see how maps are encoded.

First, let’s not take for granted that a map is a list of objects. Let’s investigate that. We can define a message containing a map field (map/encoding.proto):

syntax = "proto3";
message Encoding {
  map<string, int32> m = 1;
}

Now, to see how this translates internally, we can turn that proto file into a descriptor file. Protoc has a flag called --descriptor_set_out for doing that. Let’s create a descriptor file called encoding.desc:

$ protoc --descriptor_set_out=encoding.desc encoding.proto

This file contains a binary of FileDescriptorSet, which is a message defined in the descriptor.proto file provided with protoc. Now, we can decode this descriptor...

Summary

In this chapter, we learned about all the internal serialization/deserialization algorithms. We saw that there are multiple ways to encode integers and that is why we have that many integer types in Protobuf. After, we covered length-delimited encoding and how it relates to types such as strings, packed repeated fields, and embedded messages. Finally, we talked about unpacked repeated fields and their overhead.

In the next chapter, we will talk about schema evolution over time. We will lean on all the knowledge that we have right now to understand the problems that we might have when updating schemas and how we can overcome them.

Quiz

Answer the following questions to test your knowledge of this chapter:

Which encoding algorithm outputs a variable number of bytes depending on the value encoded?
1. ZigZag
2. Varint
3. Length-delimited
Which encoding algorithm turns negative numbers into positive ones?
1. Length-delimited
2. varint
3. ZigZag
What might be a problem with using varint?
1. It can use more bytes than the original 32- and 64-bit integers
2. It will encode negative numbers into 10 bytes
3. All the above
What might be a problem with using ZigZag?
1. It is less efficient at encoding positive numbers than varint
2. It will encode negative numbers into 10 bytes
3. It can use more bytes than the original 32- and 64-bit integers
When should you consider using fixed-sized integers?
1. Never, always prefer using varints
2. When dealing with larger numbers which will be encoded a more than 4 or 8 bytes
3. When dealing with negative numbers
What is the difference between unpacked and packed repeated fields?
1. Unpacked has overhead in terms of...

Answers

Here are the answers to this chapter’s questions:

The rest of the chapter is locked

You have been reading a chapter from

Protocol Buffers Handbook

Published in: Apr 2024Publisher: PacktISBN-13: 9781805124672

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Clément Jean

Clément Jean is the CTO of Education for Ethiopia, a start-up focusing on educating K-12 students in Ethiopia. On top of that, he is also an online instructor (on Udemy, Linux Foundation, and others) teaching people about diff erent kinds of technologies. In both his occupations, he deals with technologies such as Protobuf and gRPC and how to apply them to real-life use cases. His overall goal is to empower people through education and technology.
Read more about Clément Jean

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5