In-Memory Analytics with Apache Arrow

By Matthew Topol
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 1: Getting Started with Apache Arrow

About this book

Apache Arrow is designed to accelerate analytics and allow the exchange of data across big data systems easily.

In-Memory Analytics with Apache Arrow begins with a quick overview of the Apache Arrow format, before moving on to helping you to understand Arrow’s versatility and benefits as you walk through a variety of real-world use cases. You'll cover key tasks such as enhancing data science workflows with Arrow, using Arrow and Apache Parquet with Apache Spark and Jupyter for better performance and hassle-free data translation, as well as working with Perspective, an open source interactive graphical and tabular analysis tool for browsers. As you advance, you'll explore the different data interchange and storage formats and become well-versed with the relationships between Arrow, Parquet, Feather, Protobuf, Flatbuffers, JSON, and CSV. In addition to understanding the basic structure of the Arrow Flight and Flight SQL protocols, you'll learn about Dremio’s usage of Apache Arrow to enhance SQL analytics and discover how Arrow can be used in web-based browser apps. Finally, you'll get to grips with the upcoming features of Arrow to help you stay ahead of the curve.

By the end of this book, you will have all the building blocks to create useful, efficient, and powerful analytical services and utilities with Apache Arrow.

Publication date:
June 2022
Publisher
Packt
Pages
392
ISBN
9781801071031

 

Chapter 1: Getting Started with Apache Arrow

Regardless of whether you are a data scientist/engineer, a machine learning (ML) specialist, or a software engineer trying to build something to perform data analytics, you've probably heard or read about something called Apache Arrow and either looked for more information or wondered what it was. Hopefully, this book can serve as a springboard both in understanding what Apache Arrow is and isn't, and also as a reference book to be continuously utilized in order to supercharge your analytical capabilities.

For now, let's just start off by explaining what Apache Arrow is and what you will use it for. Following that, we will walk through the Arrow specifications, set up a development environment where you can play around with the Apache Arrow libraries, and walk through a few simple exercises to get a feel for how to use them.

In this chapter, we're going to cover the following topics:

  • Understanding the Arrow format and specifications
  • Why does Arrow use a columnar in-memory format?
  • Learning the terminology and the physical memory layout
  • Arrow format versioning and stability
  • Setting up your shooting range
 

Technical requirements

For the portion of the chapter describing how to set up a development environment for working with the Arrow libraries, you'll need the following:

  • Your preferred Integrated Development Environment (IDE): For example, VSCode, Sublime, Emacs, and Vim
  • Plugins for your desired language (optional but highly recommended)
  • Interpreter or toolchain for your desired language(s):
    • Python 3+: pip and venv and/or pipenv
    • Go 1.16+
    • C++ Compiler (capable of compiling C++11 or newer)
 

Understanding the Arrow format and specifications

According to the Apache Arrow documentation [1]:

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Well, that's a lot of technical jargon! Let's start from the top. Apache Arrow (just Arrow for brevity) is an open source project from the Apache Software Foundation that is released under the Apache License, Version 2.0 [2]. It was co-created by Dremio and Wes McKinney, the creator of pandas, and first released in 2016. To simplify, Arrow is a collection of libraries and specifications that make it easy to build high-performance software utilities for processing and transporting large datasets. It consists of a collection of libraries related to in-memory data processing, including specifications for memory layouts and protocols for sharing and efficiently transporting data between systems and processes. When we're talking about in-memory data processing, we are talking exclusively about the processing of data in RAM and eliminating slow data accesses wherever possible to improve performance. This is where Arrow excels and provides libraries to support this with utilities for streaming and transportation in order to speed up data access.

When working with data, there are two primary situations to consider, and each has different needs: the in-memory format and the on-disk format. When data is stored on disk, the biggest concerns are the size of the data and the input/output (I/O) cost to read it into the main memory before you can operate on it. As a result, formats for data on disk tend to be focused much more on increasing I/O throughput, such as compressing the data to make it smaller and faster to read into memory. One example of this might be the Apache Parquet format, which is a columnar on-disk file format. Instead of being an on-disk format, Arrow's focus is the in-memory format case, which targets CPU efficiency as the goal, with numerous tactics such as cache locality and vectorization of computation.

The primary goal of Arrow is to essentially become the lingua franca of data analytics and processing, the One Framework to Rule Them All, so to speak. Different databases, programming languages, and libraries tend to implement and use their own separate internal formats for managing data, which means that any time you are moving data between these components for different uses, you're paying a cost to serialize and deserialize that data every time. Not only that, but lots of time and resources get spent reimplementing common algorithms and processing in those different data formats over and over. If instead, we can standardize on an efficient, feature-rich internal data format that can be widely adopted and used, this excess computation and development time is no longer necessary. Figure 1.1 shows a simplified diagram of multiple systems, each with its own data format, having to be copied and/or converted in order for the different components to work with each other:

Figure 1.1 – Copy and convert components

Figure 1.1 – Copy and convert components

In many cases, the serialization and deserialization can end up taking nearly 90% of the processing time in such a system rather than being able to spend that CPU on the analytics. Alternatively, if every component is using Arrow's in-memory format, you end up with a system as in Figure 1.2, where the data can be transferred between components at little-to-no cost. All of the components can either share memory directly or send the data as-is without having to convert between different formats.

Figure 1.2 – Sharing Arrow memory between components

Figure 1.2 – Sharing Arrow memory between components

At this point, there's no need for the different components and systems to implement custom connectors or re-implement common algorithms and utilities. The same libraries and connectors can be utilized, even across programming languages and process barriers, by sharing memory directly to refer to the same data rather than copying multiple times between them.

Most data processing systems now use distributed processing by breaking the data up into chunks and sending those chunks across the network to various workers, so even if we can share memory across processes on a box, there's still the cost to send it across the network. This brings us to the final piece of the puzzle: the format of raw Arrow data on the wire is the same as it is in memory. You can directly reference the memory buffers used for the network protocols without having to deserialize that data before you can use it, or reference the memory buffers you were operating on to send it across the network without having to serialize it first. Just a bit of metadata sent along with the raw data buffers and interfaces that perform zero-copies can be created in order to achieve performance benefits, by reducing memory usage and improving CPU throughput.

Let's quickly recap the features of the Arrow format we were just describing before moving on:

  • Using the same high-performance internal format across components allows much more code reuse in libraries instead of reimplementing common workflows.
  • The Arrow libraries provide mechanisms to directly share memory buffers to reduce copying between processes by using the same internal representation regardless of the language. This is what is being referred to whenever you see the term zero-copy.
  • The wire format is the same as the in-memory format to eliminate serialization and deserialization costs when sending data across networks between components of a system.

Now, you might be thinking well this sounds too good to be true! and of course, being skeptical of promises like this is always a good idea. The community around Arrow has done a ton of work over the years to bring these ideas and concepts to fruition. The project itself provides and distributes libraries in a variety of different programming languages so that projects that want to incorporate and/or support the Arrow format don't need to implement it themselves. Above and beyond the interaction with Arrow-formatted data, the libraries provide a significant amount of utility in assisting with common processes such as data access and I/O-related optimizations. As a result, the Arrow libraries can be useful for projects, even if they don't actually utilize the Arrow format themselves.

Here's just a quick sample of use cases where using Arrow as the internal/intermediate data format can be very beneficial:

  • SQL execution engines (such as Dremio Sonar, Apache Drill, or Impala)
  • Data analysis utilities and pipelines (such as pandas or Spark)
  • Streaming and message queue systems (such as Apache Kafka or Storm)
  • Storage systems and formats (such as Apache Parquet, Cassandra, and Kudu)

As for how Arrow can help you, it depends on which piece of the data puzzle you personally work with. The following are a few different roles that work with data and show how using Arrow could potentially be beneficial; it's by no means a complete list though:

  • If you're a data scientist:
    • You can utilize Arrow via pandas and NumPy integration to significantly improve the performance of your data manipulations.
    • If the tools you use integrate Arrow support, you can gain significant speed-ups to your queries and computations by using Arrow directly yourself to reduce copies and/or serialization costs.
  • If you are a data engineer specializing in extract, transform, and load (ETL):
    • The higher adoption of Arrow as an internal and externally-facing format can make it easier to integrate with many different utilities.
    • By using Arrow, data can be shared between processes and tools with shared memory increasing the tools available to you for building pipelines, regardless of the language you're operating in. You could take data from Python and use it in Spark and then pass it directly to the Java Virtual Machine (JVM) without paying the cost of copying between them.
  • If you are a software engineer or ML specialist building computation tools and utilities for data analysis:
    • Arrow as an internal format can be used to improve your memory usage and performance by reducing serialization and deserialization between components.
    • Understanding how to best utilize the data transfer protocols can improve the ability to parallelize queries and access your data, wherever it might be.
    • Because Arrow can be used for any sort of tabular data, it can be integrated into many different areas of data analysis and computation pipelines, and is versatile enough to be beneficial as an internal and data transfer format, regardless of the shape of your data.

Now that you know what Arrow is, let's dig into its design and how it delivers on the aforementioned promises of high-performance analytics, zero-copy sharing, and network communication without serialization costs. First, you'll see why a column-oriented memory representation was chosen for Arrow's internal format. Afterward, in later chapters, we'll cover specific integration points, explicit examples, and transfer protocols.

 

Why does Arrow use a columnar in-memory format?

Most traditional data processing of tabular data will have its own custom data structures for representing and managing those datasets in memory while processing them, such as query engines and data services, for example. Of course, if there are custom data structures, this means it requires developing custom serialization protocols between file formats, network protocols, libraries, and any other interface you could think of. I can vouch from experience that the result is a huge amount of developer time and CPU cycles being wasted dealing with these various serialization schemes, rather than being able to spend it all on the analytical workloads. One goal of the Arrow project is for fewer systems to have to create their own data structures and utilize Arrow as their internal format. Doing so would allow those components to expose Arrow directly as a wire format and benefit from not having to pay a serialization or deserialization cost to pass the data around.

There is often a lot of debate surrounding whether a database should be row-oriented or column-oriented, but this primarily refers to the on-disk format of the underlying storage files. Arrow's data format is different from most cases discussed so far since it uses a columnar organization of data structures in memory directly. If you're not familiar with columnar as a term, let's take a look at what exactly it means. First, imagine the following table of data:

Figure 1.3 – Sample data table

Figure 1.3 – Sample data table

Traditionally, if you were to read this table into memory, you'd likely have some structure to represent a row and then read the data in one row at a time. Maybe something like struct { string archer; string location; int year }. The result is that you have the memory grouped closely together for each row, which is great if you always want to read all the columns for every row. But, if this were a much bigger table, and you just wanted to find out the minimum and maximum years or any other column-wise analytics such as the unique locations, you would have to read the whole table into memory and then jump around in memory, skipping the fields you didn't care about so that you could read the value for each row of one column.

Most operating systems, while reading data into main memory and CPU caches, will attempt to make predictions about what memory it is going to need next. In our example table of archers, consider how many memory pages of data would have to be accessible and traversed to get a list of unique locations if the data were organized in row or column orientations:

Figure 1.4 – Row versus columnar memory buffers

Figure 1.4 – Row versus columnar memory buffers

A columnar format keeps the data organized by column instead of by row, as shown in the preceding figure. As a result, operations such as grouping, filtering, or aggregations based on column values become much more efficient to perform since the entire column is already contiguous in memory. Considering memory pages again, it's plain to see that for a large table, there would be significantly more pages that need to be traversed to get a list of unique locations from a row-oriented buffer than a columnar one. Fewer page faults and more cache hits mean increased performance and a happier CPU. Computational routines and query engines tend to operate on subsets of the columns for a dataset rather than needing every column for a given computation, making it significantly more efficient to operate on columnar data.

If you look closely at the construction of the column-oriented data buffer on the right side of Figure 1.4, you can see how it benefits the queries I mentioned earlier. If we wanted all the archers that are in Europe, we can easily scan through just the location column and discover which rows are the ones we want, and then spin through just the archer block and grab only the rows that correspond to the row indexes we found. This will come into play again when we start looking at the physical memory layout of Arrow arrays; since the data is column-oriented, it makes it easier for the CPU to predict instructions to execute and maintains this memory locality between instructions.

By keeping the column data contiguous in memory, it enables vectorization of the computations. Most modern processors have single instruction, multiple data (SIMD) instructions available that can be taken advantage of for speeding up computations and require having the data in a contiguous block of memory to operate on it. This concept can be found heavily utilized by graphics cards, and in fact, Arrow provides libraries to take advantage of Graphics Processing Units (GPUs) precisely because of this. Consider the example where you might want to multiply every element of a list by a static value, such as performing a currency conversion on a column of prices with an exchange rate:

Figure 1.5 – SIMD/vectorized versus non-vectorized

Figure 1.5 – SIMD/vectorized versus non-vectorized

From the figure, you can see the following:

  • The left side of the figure shows that an ordinary CPU performing the computation in a non-vectorized fashion requires loading each value into a register, multiplying it with the exchange rate, and then saving the result back into RAM.
  • On the right side of the figure, we see that vectorized computation, such as using SIMD, performs the same operation on multiple different inputs at the same time, enabling a single load to multiply and save to get the result for the entire group of prices. Being able to vectorize a computation has various constraints; often, one of those constraints is requiring the data being operated on to be in a contiguous chunk of memory, which is why columnar data is much easier to do this with.

    SIMD versus Multithreading

    If you're not familiar with SIMD, you may wonder how it differs from another parallelization technique: multithreading. Multithreading operates at a higher conceptual level than SIMD. Each thread has its own set of registers and memory space representing its execution context. These contexts could be spread across separate CPU cores or possibly interleaved by a single CPU core switching whenever it needs to wait for I/O. SIMD is a processor-level concept that refers to the specific instructions being executed. Put simply, multithreading is multitasking and SIMD is doing less work to achieve the same result.

Another benefit of utilizing column-oriented data comes into play when considering compression techniques. At some point, your data will become large enough that sending it across the network could become a bottleneck, purely due to size and bandwidth. With the data being grouped together in columns that are all the same type as contiguous memory, we end up with significantly better compression ratios than we would get with the same data in a row-oriented configuration, simply because data of the same type is easier to compress together than data of different types.

 

Learning the terminology and physical memory layout

As mentioned before, the Arrow columnar format specification includes definitions of the in-memory data structures, metadata serialization, and protocols for data transportation. The format itself has a few key promises, as follows:

  • Data adjacency for sequential access
  • O(1) (constant time) random access
  • SIMD and vectorization friendly
  • Relocatable, allowing for zero-copy access in shared-memory

To ensure we're all on the same page, here's a quick glossary of terms that are used throughout the format specification and the rest of the book:

  • Array: A list of values with a known length of the same type.
  • Slot: The value in an array identified by a specific index.
  • Buffer/contiguous memory region: A single contiguous block of memory with a given length.
  • Physical layout: The underlying memory layout for an array without accounting for the interpretation of the logical value. For example, a 32-bit signed integer array and a 32-bit floating-point array are both laid out as contiguous chunks of memory where each value is made up of four contiguous bytes in the buffer.
  • Parent/child arrays: Terms used for the relationship between physical arrays when describing the structure of a nested type. For example, a struct parent array has a child array for each of its fields.
  • Primitive type: A type that has no child types and so consists of a single array, such as fixed-bit-width arrays (for example, int32) or variable-size types (for example, string arrays).
  • Nested type: A type that depends on one or more other child types. Nested types are only equal if their child types are also equal (for example, List<T> and List<U> are equal if T and U are equal).
  • Logical type: A particular type of interpreting the values in an array that is implemented using a specific physical layout. For example, the decimal logical type stores values as 16 bytes per value in a fixed-size binary layout. Similarly, a timestamp logical type stores values using a 64-bit fixed-size layout.

Now that we've got the fancy words out of the way, let's have a look at how we actually lay out these arrays in memory. An array or vector is defined by the following information:

  • A logical data type (typically identified by an enum value and metadata)
  • A group of buffers
  • A length as a 64-bit signed integer
  • A null count as a 64-bit signed integer
  • Optionally, a dictionary for dictionary-encoded arrays (more on these later in the chapter)

To define a nested array type, there would additionally be one or more sets of this information that would then be the child arrays. Arrow defines a series of logical types and each one has a well-defined physical layout in the specification. For the most part, the physical layout just affects the sequence of buffers that make up the raw data. Since there is a null count in the metadata, it comes as a given that any value in an array may be considered to be null data rather than having a value, regardless of the type. Apart from the union data type, all the arrays have a validity bitmap as one of their buffers, which can optionally be left out if there are no nulls in the array. As might be expected, 1 in the corresponding bit means it is a valid value in that index, and 0 means it's null.

Quick summary of physical layouts, or TL;DR

When working with Arrow formatted data, it's important to understand how it is physically laid out in memory. Understanding the physical layouts can provide ideas for efficiently constructing (or deconstructing) Arrow data when developing applications. Here's a quick summary:

Figure 1.6 – Table of physical layouts

Figure 1.6 – Table of physical layouts

The following is a walkthrough of the physical memory layouts that are used by the Arrow format. This is primarily useful for either implementing the Arrow specification yourself (or contributing to the libraries) or if you simply want to know what's going on under the hood and how it all works.

Primitive fixed-length value arrays

Let's look at an example of a 32-bit integer array that looks like this: [1, null, 2, 4, 8]. What would the physical layout look like in memory based on the information so far (Figure 1.7)? Something to keep in mind is that all of the buffers should be padded to a multiple of 64 bytes for alignment, which matches the largest SIMD instructions available on widely deployed x86 architecture processors (Intel AVX-512), and that the values for null slots are marked UNF or undefined. Implementations are free to zero out the data in null slots if they desire, and many do. But, since the format specification does not define anything, the data in a null slot could technically be anything.

Figure 1.7 – Layout of primitive int32 array

Figure 1.7 – Layout of primitive int32 array

This same conceptual layout is the case for any fixed-size primitive type, with the only exception being that the validity buffer can be left out entirely if there are no nulls in the array. For any data type that is physically represented as simple fixed-bit-width values, such as integers, floating-point values, fixed-size binary arrays, or even timestamps, it will use this layout in memory. The padding for the buffers in the subsequent diagrams will be left out just to avoid cluttering them.

Variable-length binary arrays

Things get slightly trickier when dealing with variable-length value arrays, generally used for variable size binary or string data. In this layout, every value can consist of 0 or more bytes and, in addition to the data buffer, there will also be an offsets buffer. Using an offsets buffer allows the entirety of the data of the array to be held in a single contiguous memory buffer. The only lookup cost for finding the value of a given index is to look up the indexes in the offsets buffer to find the correct slice of the data. The offsets buffer will always contain length + 1 signed integers (either 32-bit or 64-bit, based on the logical type being used) that indicate the starting position of each corresponding slot of the array. Consider the array of two strings: [ "Water", "Rising" ].

Figure 1.8 – Arrow string versus traditional string vector

Figure 1.8 – Arrow string versus traditional string vector

This differs from a lot of standard ways of representing a list of strings in memory in most library models. Generally, a string is represented as a pointer to a memory location and an integer for the length, so a vector of strings is really a vector of these pointers and lengths (Figure 1.8). For many use cases, this is very efficient since, typically, a single memory address is going to be much smaller than the size of the string data, so passing around this address and length is efficient for referencing individual strings.

Figure 1.9 – Viewing string index 1

Figure 1.9 – Viewing string index 1

If your goal is operating on a large number of strings though, it's much more efficient to have a single buffer to scan through in memory. As you operate on each string, you can maintain the memory locality we mentioned before, keeping the memory we need to look at physically close to the next chunk of memory we're likely going to need. This way, we spend less time jumping around different pages of memory and can spend more CPU cycles performing the computations. It's also extremely efficient to get a single string, as you can simply take a view of the buffer by using the address indicated by the offset to create a string object without copying the data.

List and fixed-size list arrays

What about nested formats? Well, they work in a similar way to the variable-length binary format. First up is the variable-size list layout. It's defined by two buffers, a validity bitmap and an offsets buffer, along with a child array. The difference between this and the variable-length binary format is that instead of the offsets referencing a buffer, they are instead indexes into the child array (which could itself potentially be a nested type). The common denotation of list types is to specify them as List<T>, where T is any type at all. When using 64-bit offsets instead of 32-bit, it is denoted as LargeList<T>. Let's represent the following List<Int8> array: [[12, -7, 25], null, [0, -127, 127, 50], []].

Figure 1.10 – Layout of list array

Figure 1.10 – Layout of list array

The first thing to notice in the preceding diagram is that the offsets buffer has exactly one more element than the List array it belongs to since there are four elements to our List<Int8> array and we have five elements in the offsets buffer. Each value in the offsets buffer represents the starting slot of the corresponding list index i. Looking closer at the offsets buffer, we notice that 3 and 7 are repeating, indicating that those lists are either null or empty (have a length of 0). To discover the length of a list at a given slot, you simply take the difference between the offset for that slot and the offset after it: , and the same holds true for the previous variable-length binary format; the number of bytes for a given slot is the difference in the offsets. Knowing this, what is the length of the list at index 2 of Figure 1.10? (remember, 0-based indexes!). With this, we can tell that the list at index 3 is empty because the bitmap has a 1, but the length is 0 (7 – 7). This also explains why we need that extra element in the offsets buffer! We need it to be able to calculate the length of the last element in the array.

Given that example, what would a List<List<Int8>> array look like? I'll leave that as an exercise for you to figure out.

There's also a FixedSizeList<T>[N] type, which works nearly the same as the variable-sized list, except there's no need for an offsets buffer. The child array of a fixed-size list type is the values array, complete with its own validity buffer. The value in slot of a fixed-size list array is stored in an -long slice of the values array, starting at offset . Figure 1.11 shows what this looks like:

Figure 1.11 – Layout of fixed-size list array

Figure 1.11 – Layout of fixed-size list array

What's the benefit of FixedSizeList versus List? Look back at the two diagrams again! Determining the values for a given slot of FixedSizeList doesn't require any lookups into a separate offsets buffer, making it more efficient if you know that your lists will always be a specific size. As a result, you also save space by not needing the extra memory for an offsets buffer at all!

Important Note

One thing to keep in mind is the semantic difference between a null value and an empty list. Using JSON notation, the difference is equivalent to the difference between null and []. The meaning of such a difference would be up to a particular application to decide, but it's important to note that a null list is not identical to an empty list, even though the only difference in the physical representation is the bit in the validity bitmap.

Phew! That was a lot. We're almost done!

Struct arrays

The next type on our tour of the Arrow format is the struct type's layout. A struct is a nested type that has an ordered sequence of fields that can all have their own distinct types. It's semantically very similar to a simple object with attributes that you might find in a variety of programming languages. Each field must have its own UTF-8 encoded name, and these field names are part of the metadata for defining a struct type. Instead of having any physical storage allocated for its values, a struct array has one child array for each of its fields. All of these children arrays are independent and don't need to be adjacent to each other in memory; remember our goal is column- (or field-) oriented, not row-oriented. A struct array must, however, have a validity bitmap if it contains one or more null struct values. It can still contain a validity bitmap if there are no null values, it's just optional in that case.

Let's use the example of a struct with the following structure: Struct<name: VarBinary, age: Int32>. An array of this type would have two child arrays, one VarBinary array (a variable-sized binary layout), and one 4-byte primitive value array having a logical type of Int32. With this definition, we can map out a representation of the array: [{"joe", 1}, {null, 2}, null, {"mark", 4}].

Figure 1.12 – Layout of struct array

Figure 1.12 – Layout of struct array

When an entire slot of the struct array is set to null, the null is represented in the parent's validity bitmap, which is different from a particular value in a child array being null. In Figure 1.12, the child arrays each have a slot for the null struct in which they could have any value at all, but would be hidden by the struct array's validity bitmap marking the corresponding struct slot as null and taking priority over the children.

Union arrays – sparse and dense

For the case when a single column could have multiple types, there exists the Union type array. Whereas the struct array is an ordered sequence of fields, a union type is an ordered sequence of types. The value in each slot of the array could be of any of these types, which are named like struct fields and included in the metadata of the type. Unlike other layouts, the union type does not have its own validity bitmap. Instead, each slot's validity is determined by the children, which are composed to create the union array itself. There are two distinct union layouts that can be used when creating an array: dense and sparse, each optimized for a different use case.

A dense union represents a mixed-type array with 5 bytes of overhead for each value, and contains the following structures:

  • One child array for each type
  • A types buffer: A buffer of 8-bit signed integers, with each value representing the type ID for the corresponding slot, indicating which child vector to read from for that slot
  • An offsets buffer: A buffer of signed 32-bit integers, indicating the offset into the corresponding child's array for the type in each slot

The dense union allows for the common use case of a union of structs with non-overlapping fields: Union<s1: Struct1, s2: Struct2, s3: Struct3……>. Here's an example of the layout for a union of type Union<f: float, i: int32> with the values [{f=1.2}, null, {f=3.4}, {i=5}]:

Figure 1.13 – Layout of dense union array

Figure 1.13 – Layout of dense union array

A sparse union has the same structure as the dense, except without an offsets array, as each child array is equal in length to the union itself. Figure 1.14 shows the same union array from Figure 1.13 as a sparse union array. There's no offsets buffer; both children are the same length of 4 as opposed to being different lengths:

Figure 1.14 – Layout of sparse union array

Figure 1.14 – Layout of sparse union array

Even though a sparse union takes up significantly more space compared to a dense union, it has some advantages for specific use cases. In particular, a sparse union is much more easily used with vectorized expression evaluation in many cases, and a group of equal length arrays can be interpreted as a union by only having to define the types buffer. When interpreting a sparse union, only the slot in a child indicated by the types array is considered; the rest of the unselected values are ignored and could be anything.

Dictionary-encoded arrays

Next, we arrive at the layout for dictionary-encoded arrays. If you have data that has many repeated values, then significant space can potentially be saved by using dictionary encoding to represent the data values as integers referencing indexes into a dictionary that usually consists of unique values. Since a dictionary is an optional property on any array, any array can be dictionary-encoded. The layout of a dictionary-encoded array is that of a primitive integer array of non-negative integers, which each represent the index in the dictionary. The dictionary itself is a separate array with its own respective layout of the appropriate type.

For example, let's say you have the following array: ["foo", "bar", "foo", "bar", null, "baz"]. Without dictionary encoding, we'd have an array that looks like this:

Figure 1.15 – String array without dictionary encoding

Figure 1.15 – String array without dictionary encoding

If we add dictionary encoding, we just need to get the unique values and create an array of indexes that references a dictionary array. The common case is to use int32, but any integral type would work:

Figure 1.16 – Dictionary-encoded string array

Figure 1.16 – Dictionary-encoded string array

For this trivial example, it's not particularly enticing, but it's very clear how, in the case of an array with a lot of repeated values, this could be a significant memory usage improvement. You can even perform operations directly on a dictionary array, updating the dictionary if needed or even swapping out the dictionary and replacing it.

As written in the specification, a dictionary is allowed to contain duplicates and even null values. However, the null count of a dictionary-encoded array is dictated by the validity bitmap of the indices, regardless of any nulls that might be in the dictionary itself.

Null arrays

Finally, there is only one more layout, but it's simple: a null array. A null array is an optimized layout for an array of all null values, with the type set to null; the only thing it contains is a length, no validity bitmap, and no data buffer.

How to speak Arrow

We've mentioned a few of the logical types already when describing the physical layouts, but let's get a full description of the current available logical types, as of Arrow release version 7.0.0, for your reading pleasure. In general, the logical types are what is referred to as the data type of an array in the libraries rather than the physical layouts. These types are what you will generally see when working with Arrow arrays in code:

  • Null logical type: Null physical type
  • Boolean: Primitive array with data represented as a bitmap
  • Primitive integer types: Primitive, fixed-size array layout:
    • Int8, Uint8, Int16, Uint16, Int32, Uint32, Int64, and Uint64
  • Floating-point types: Primitive fixed-size array layout:
    • Float16, Float32 (float), and Float64 (double)
  • VarBinary types: Variable length binary physical layout:
    • Binary and String (UTF-8)
    • LargeBinary and LargeString (variable length binary with 64-bit offsets)
  • Decimal128 and Decimal256: 128-bit and 256-bit fixed-size primitive arrays with metadata to specify the precision and scale of the values
  • Fixed-size binary: Fixed-size binary physical layout
  • Temporal types: Primitive fixed-size array physical layout
    • Date types: Dates with no time information:
      • Date32: 32-bit integers representing the number of days since the Unix epoch (1970-01-01)
      • Date64: 64-bit integers representing milliseconds since the Unix epoch (1970-01-01)
    • Time types: Time information with no date attached:
      • Time32: 32-bit integers representing elapsed time since midnight as seconds or milliseconds. A unit specified by metadata.
      • Time64: 64-bit integers representing elapsed time since midnight as microseconds or nanoseconds. A unit specified by metadata.
    • Timestamp: 64-bit integer representing the time since the Unix epoch, not including leap seconds. Metadata defines the unit (seconds, milliseconds, microseconds, or nanoseconds) and, optionally, a time zone as a string.
    • Interval types: An absolute length of time in terms of calendar artifacts:
      • YearMonth: Number of elapsed whole months as a 32-bit signed integer.
      • DayTime: Number of elapsed days and milliseconds as two consecutive 4-byte signed integers (8-bytes total per value).
      • MonthDayNano: Elapsed months, days, and nanoseconds stored as contiguous 16-byte blocks. Months and days as two 32-bit integers and nanoseconds since midnight as a 64-bit integer.
      • Duration: An absolute length of time not related to calendars as a 64-bit integer and a unit specified by metadata indicating seconds, milliseconds, microseconds, or nanoseconds.
  • List and FixedSizeList: Their respective physical layouts:
    • LargeList: A list type with 64-bit offsets
  • Struct, DenseUnion, and SparseUnion types: Their respective physical layouts
  • Map: A logical type that is physically represented as List<entries: Struct<key: K, value: V>>, where K and V are the respective types of the keys and values in the map:
    • Metadata is included indicating whether or not the keys are sorted.

Whenever speaking about the types of an array from an application or semantic standpoint, we will always be using the types indicated in the preceding list to describe them. As you can see, the logical types make it very easy to represent both flat and hierarchical types of data. Now that we've covered the physical memory layouts, let's have a quick word about the versioning and stability of the Arrow format and libraries.

 

Arrow format versioning and stability

In order to ensure confidence that updating the version of the Arrow library in use won't break applications and the long-term stability of the Arrow project, there are two versions used to describe each release of the project: The format version and the library version. Different library implementations and releases can have different versions, but will always be implementing a specific format version. From version 1.0.0 onward, semantic versioning is used with releases.

Provided the major version of the format is the same between two libraries, any new library is backward-compatible with any older library with regards to being able to read data and metadata produced by an older library. Increases in the minor version of the format, such as an increase from version 1.0.0 to version 1.1.0, indicate new features that were added. As long as these new features are not used (such as new logical types or physical layouts), older libraries will be able to read data and metadata produced by newer versions of the libraries.

As far as the long-term stability of the format and libraries, only increases in the major version of the format would indicate any issue with the previous guarantees about compatibility. The Arrow project says that they do not expect this to be a frequent occurrence, rather it would be an exceptional event, in which case such a release would exercise caution for deployment. As a result of these compatibility guarantees, it ends up being safe and simple to ensure backward and forward compatibility when using the Arrow libraries and format.

 

Would you download a library? Of course!

As mentioned before, the Arrow project contains a variety of libraries for multiple programming languages. These official libraries enable anyone to work with Arrow data without having to implement the Arrow format themselves, regardless of the platform and programming language they are utilizing. There are two primary types of libraries that exist so far: ones that are distinct implementations of the Arrow specification, and ones that are built on other implementations. As of the time of writing this book, there are currently implementations for Arrow in C++ [3], C# [4], Go [5], Java [6], JavaScript [7], Julia [8], and Rust [9], which are all distinct implementations.

On top of those, there are libraries for C (Glib) [10], MATLAB [11], Python [12], R [13], and Ruby[14], which are all built on top of the C++ library, which happens to have the most active development. As you might expect, the various implementations all have different stages as far as what features and aspects of the specification are implemented, and the documentation helpfully provides an implementation matrix showing what features are implemented in which libraries. The implementation matrix [15] is then updated as these aspects of the specification and features are implemented in a given library.

With so many different implementations, you might be concerned about interoperability between them. As a result, the various library versions are integration tested via automated continuous integration (CI) jobs in order to ensure this interoperability among them. Depending on the language and development, these libraries are tested on a very large variety of platforms, including but not limited to the following:

  • x86/x86-64
  • arm64
  • s390x (IBM Mainframes)
  • macOS
  • Windows 32 and 64 bit
  • Debian/Ubuntu/Red Hat/CentOS

These libraries are deployed with their various respective package managing methods to attempt to make it as easy as possible to acquire and download the libraries. As a result, there's been significant adoption of Arrow, whether you're a data scientist using pandas, numpy, or Dask, or you're performing calculations and analytics using Apache Spark or AirFlow. And, if you're looking to get the libraries so you can try them out for yourself, the Apache Software Foundation hosts various ways to download and acquire the libraries.

Some of the channels where the libraries are made available are as follows:

When developing something that will utilize the Arrow libraries, keep the terms that were mentioned a few pages ago in mind, as most of the libraries utilize similar terminology and naming for describing their Application Programming Interfaces (APIs).

 

Setting up your shooting range

By now, you should have a pretty solid understanding of what Arrow is, the basics of how it's laid out in memory, and the basic terminology. So now, let's set up a development environment where you can test out and play with Arrow. For the purposes of this book, I'm going to primarily focus on the three libraries that I'm most familiar with: the C++ library, the Python library, and the Go library. While the basic concepts will apply to all of the implementations, the precise APIs may differ between them so, armed with the knowledge gained so far, you should be able to make sense of the documentation for your preferred language, even without precise examples for that language being printed here.

For each of C++, Python, and Go, after the instructions for installing the Arrow library, I'll go through a few exercises to get you acquainted with the basics of using the Arrow library in that language.

Using pyarrow For Python

With data science being a primary target of Arrow, it's no surprise that the Python library tends to be the most commonly used and interacted with by developers. Let's start with a quick introduction to setting up and using the pyarrow library for development.

Most modern IDEs provide plugins with exceptional Python support so you can fire up your preferred Python development IDE. I highly recommend using one of the methods for creating virtual environments with Python, such as pipenv, venv, or virtualenv, for setting up your environment. After creating that virtual environment, in most cases, installing pyarrow is as simple as using pip to install it:

$ pipenv install pyarrow # this or
$ python3 -m venv arrow_playground && pip3 install pyarrow # this

It's also possible that, depending on your settings and platform, pip may attempt to build pyarrow locally. You can use the –-prefer-binary or –-only-binary arguments to tell pip to install the pre-build binary package rather than build from source:

$ pip3 install pyarrow --only-binary pyarrow

Alternately to using pip, Conda [16] is a common toolset utilized by data scientists and engineers, and the Arrow project provides binary Conda packages on conda-forge [17] for Linux, macOS, and Windows for Python 3.6+. You can install it with Conda and conda-forge as follows:

$ conda install pyarrow=6.0.* -c conda-forge

Understanding the basics of pyarrow

With the package installed, first let's confirm that the package installed successfully by opening up the Python interpreter and trying to import the package:

>>> import pyarrow as pa
>>> arr = pa.array([1,2,3,4])
>>> arr
<pyarrow.lib.Int64Array object at 0x0000019C4EC153A8>
[
  1,
  2,
  3,
  4
]

The important piece here to note is the highlighted lines where we import the library and create a simple array, letting the library determine the type for us, which it decides on using Int64 as the logical type.

Now that we've got a working installation of the pyarrow library, we can create a small example script to generate some random data and create a record batch:

import pyarrow as pa
import numpy as np
NROWS = 8192
NCOLS = 16
data = [pa.array(np.random.randn(NROWS)) for i in range(NCOLS)]
cols = ['c' + str(i) for i in range(NCOLS)] 
rb = pa.RecordBatch.from_arrays(data, cols)
print(rb.schema)
print(rb.num_rows)

Going through this trivial example, this is what happens:

  1. First, the numpy library is used to generate a bunch of data to use for our arrays. Calling pa.array(values), where values is a list of values for the array, will construct an array with the library inferring the logical type to use.
  2. Next, a list of strings in the style of 'c0', 'c1', 'c2'… is created as names for the columns.
  3. Finally, the highlighted line is where we construct a record batch from this random data, and then the subsequent two lines print out the schema and the number of rows.

We have got a new term here, record batch! A record batch is a common concept used when interacting with Arrow that we'll see show up in many places and refers to a group of equal length arrays and a schema. Often, a record batch will be a subset of rows of a larger dataset with the same schema. Record batches are a useful unit of parallelization for operating on data, as we'll see more in-depth in later chapters. That said, a record batch is actually very similar to a struct array when you think about it. Each field in a struct array can correspond to a column of the record batch. Let's use our archer example from earlier:

Figure 1.17 – Archer struct array

Figure 1.17 – Archer struct array

Since we're talking about a struct array, it will use the struct physical layout: an array with one child array for each field of our struct. This means that to refer to the entire struct at index i, you simply get the value at index i from each of the child arrays in the same way that if you were looking at a record batch; you do the same thing to get the semantic row at index i (Figure 1.17).

When constructing such an array, there are a couple of ways to think about how it would look in code. You could build up your struct array by building all three children simultaneously in a row-based fashion, or you could build up the individual child arrays completely separately and then just semantically group them together as a struct array with the column names. This shows another benefit of using columnar-based in-memory handling of this type of structure: each column could potentially be built in parallel and then brought back together at the end without the need for any extraneous copies. Parallelizing in a row-oriented fashion would typically be done by grouping batches of these records together and operating on the batches in parallel, which can still be done with the column-oriented approach, providing extra avenues of parallelization that wouldn't have existed in a row-oriented solution.

Building a struct array

The following steps describe how to construct a struct array from your data using a Python dictionary, but the data itself could come from anywhere, such as a JSON or CSV file:

  1. First, let's create a dictionary of our archers from previously to represent our data:
    archer_list = [{
          'archer': 'Legolas', 
          'location': 'Mirkwood', 
          'year': 1954,
        },{
           'archer': 'Oliver',
           'location': 'Star City',
           'year': 1941,
        }, ……]

The rest of the values in this list are just the values from all the way back in Figure 1.3!

  1. Then, we define a data type for our struct array:
    archer_type = pa.struct([('archer', pa.utf8()),
                             ('location', pa.utf8()),
                             ('year', pa.int16())])
  2. Now, we can construct the struct array itself:
    archers = pa.array(archer_list, type=archer_type)
    print(archers.type)
    print(archers)

    Data Types

    See the usage of pa.utf8() and pa.int16()? These usages are creating data type instances with the data types API. Specifying a list would be pa.list_(t1), where t1 is some other type, just as we're doing here with pa.struct; check the documentation [18] for the full listing.

The output is as follows (assuming you pulled the data from Figure 1.3 as I said):

struct<archer: string, location: string, year: int16>
-- is_valid: all not null
-- child 0 type: string
  [
    "Legolas",
    "Oliver",
    "Merida",
    "Lara",
    "Artemis"
  ]
-- child 1 type: string
  [
    "Mirkwood",
    "Star City",
    "Scotland",
    "London",
    "Greece"
  ]
-- child 2 type: int16
  [
    1954,
    1941,
    2012,
    1996,
    -600
  ]

Do you recognize the similarity between the printed struct data and our earlier example of columnar data?

Using record batches and zero-copy manipulation

Often, after ingesting some data, there is still a need to further clean or reorganize it before running whatever processing or analytics you need to do. Being able to rearrange and move around the structure of your data like this with Arrow without having to make copies also results in some significant performance improvements over other approaches. To exemplify how we can optimize memory usage when utilizing Arrow, we can take the arrays from the struct array we created and easily flatten them into a record batch without any copies being made. Let's take the struct array of archers and flatten it into a record batch:

# archers is the struct array created earlier, flatten() returns
# the fields of the struct array as a python list of array objects
# remember 'pa' is from import pyarrow as pa
rb = pa.RecordBatch.from_arrays(archers.flatten(), 
                                ['archer', 'location', 'year'])
print(rb)
print(rb.num_rows) # prints 5
print(rb.num_columns) # prints 3

Since our struct array was 3 fields and had a length of 5, our record batch will have five rows and three columns. Record batches require having a schema defined, which is similar to defining a struct type; it's a list of fields, each with a name, a logical type, and metadata. The highlighted print statement in the preceding code to print out the record batch will just print the schema of the record batch:

pyarrow.RecordBatch
archer: string
location: string
year: int16

The record batch we created holds references to the exact same arrays we created for the struct array, not copies, which makes this a very efficient operation, even for very large data sets. Cleaning, restructuring, and manipulating raw data into a more understandable or easier to work with format is a common task for data scientists and engineers. One of the strengths of using Arrow is that this can be done efficiently and without making copies of the data.

Another common situation when working with data is when you only need a particular slice of your dataset to work on, rather than the entire thing. As before, the library provides a slice function for slicing record batches or arrays without copying memory. Think back to the structure of the arrays; because any array has a length, null count, and sequence of buffers, the buffers that are used for a given array can be slices of the buffers from a larger array. This allows working with subsets of data without having to copy it around.

Figure 1.18 – Making a slice

Figure 1.18 – Making a slice

A slice of a record batch is just slicing each of the constituent arrays which make it up; the same goes for any array of a nested type. Using our previous example, we use the following:

slice = rb.slice(1,3) # (start, length)
print(slice.num_rows) # prints 3 not 5
print(rb.column(0)[0]) # <pyarrow.StringScalar: 'Legolas'>
print(slice.column(0)[0]) # <pyarrow.StringScalar: 'Oliver'>

There's also a shortcut syntax for slicing an array, which should be comfortable for Python developers since it matches the same syntax for slicing a Python list:

archerslice = archers[1:3] # slice of length 2 viewing indexes 1
# and 2 from the struct array, so it slices all three arrays

One thing that does make a copy though is to convert Arrow arrays back to native Python objects for use with any other Python code that isn't using Arrow. Just like I mentioned back at the beginning of this chapter, shifting between different formats instead of the libraries all using the same one has costs to copy and convert the data:

print(rb.to_pydict()) # prints dictionary {column: list<values>}
print(archers.to_pylist()) # prints the same list of                              dictionaries
                           # we started with

Both of the preceding calls, to_pylist and to_pydict, perform copies of the data in order to put them into native Python object formats and should be used sparingly with large datasets.

Handling none values

The last thing to mention is the handling of null values. The None Python object is always converted to an Arrow null element when converting to an array, and vice versa when converting back to native Python objects.

An exercise for you

To get a feel for what real usage of the library might look like, here's an exercise to try out. You can find the solution along with the full code for any examples in the book in a GitHub repository located at https://github.com/PacktPublishing/In-Memory-Analytics-with-Apache-Arrow-:

  • Take a (row-wise) list of objects with the following structure and convert them to a column-oriented record batch:
    { id: int, cost: double, cost_components: list<double> }

An example might be { "id": 4, "cost": 241.21, "cost_components": [ 100.00, 140.10, 1.11] } for a single object.

  • Now that you've converted the row-based data in the list to a column-oriented Arrow record batch, do the reverse and convert the record batch back into the row-oriented list representation.

Now, let's take a look at the C++ library.

C++ for the 1337 coders

Due to the nature of C++, the setup potentially isn't as straightforward as Python or Go. There are a few different routes you can use to install the development headers and libraries, along with the necessary dependencies depending on your desired platform.

Technical requirements for using C++

Before we can develop, you need to first install the Arrow library on your system. The process is obviously going to differ based on the operating system you're using:

  • If you are using Windows, you will need one of the following, along with either Visual Studio, C++, or Mingw gcc/g++ as your compiler:
    • Conda: Replace 7.0.0 with the version you wish to install:
      conda install arrow-cpp=7.0.0 -c conda-forge
    • MSYS2 [19]: After installing MSYS2, you can use pacman to install the libraries:
      • For 64-bit:
      pacman -S --noconfirm mingw-w64-x86_64-arrow
      • For 32-bit:
      pacman -S -–noconfirm mingw-w64-i686-arrow
    • vcpkg: This is kept up to date by Microsoft team members and community contributors:
      git clone https://github.com/Microsoft/vcpkg.git
      cd vcpkg
      ./bootstrap-vcpkg.sh
      ./vcpkg integrate install
      ./vcpkg install arrow

Build it from source yourself: https://arrow.apache.org/docs/developers/cpp/windows.html.

Whichever way you decide to install the libraries, you need to add the path to where it installs the libraries to your environment path in order for them to be found at runtime.

  • If using macOS and you don't want to build it yourself from source, you can use Homebrew [20] to install the library:
    brew install apache-arrow
  • If using Linux and you don't want to build it from source:

Packages for Debian GNU/Linux, Ubuntu, CentOS, Red Hat Enterprise Linux, and Amazon Linux are provided via APT and Yum repositories.

Rather than cover all of the instructions here, all of the installation instructions for the C++ library can be found at https://arrow.apache.org/install/. Only libarrow-dev is needed for the exercises in this chapter.

Once you've got your environment all set up and configured, let's take a look at the code. When compiling it, the easiest route is to use pkg-config if it's available on your system, otherwise, make sure you've added the correct include path and link against the Arrow library with the appropriate options (-I<path to arrow headers> -L<path to arrow libraries> -larrow).

Just like with the Python examples, let's start with a very simple example to walk through the API of the library.

Understanding the basics of the C++ library

Let's do the same first example in C++ that we did in Python:

#include <arrow/api.h>
#include <arrow/array.h>
#include <iostream>
int main(int argc, char** argv) {
    std::vector<int64_t> data{1,2,3,4};
    auto arr = std::make_shared<arrow::Int64Array>(data.size(), arrow::Buffer::Wrap(data));
    std::cout << arr->ToString() << std::endl;
}

Just like the Python example previously, this outputs the following:

[
  1,
  2,
  3,
  4,
]

Let's break down the highlighted line in the source code and explain what we did.

After creating std::vector of int64_ts to use as an example, we initialize std::shared_ptr to Int64Array by specifying the array length, or the number of values, and then wrapping the raw contiguous memory of the vector in a buffer for the array to use as its value buffer. It's important to note that using Buffer::Wrap does not copy the data, instead we're just referencing the memory that is used for the vector and using that same block of memory for the array. Finally, we use the ToString method of our array to create a string representation that we then output. Pretty straightforward, but also very useful in terms of getting used to the library and confirming your environment is set up properly.

When working with the C++ library, the Builder Pattern is commonly used for efficient construction of arrays. We can do the same random data example in C++ that we did earlier using Python, although it's a bit more verbose. Instead of numpy, we can just use the std library's normal distribution generator:

#include <random>
// later on
std::random_device rd{};
std::mt19937 gen{rd()};
std::normal_distribution<> d{5, 2};

Once we've done this setup, we can use d(gen) to produce random 64-bit float (or double) values. All that's left is to feed them into a builder and generate the arrays and a schema since, in order to create a record batch, you need to provide a schema.

First, we create our builder:

#include <arrow/builder.h>
auto pool = arrow::default_memory_pool();
arrow::DoubleBuilder builder{arrow::float64(), pool};

Just like how in Python we had pa.utf8() and pa.int16(), arrow::float64() returns a DataType object that is used to denote the logical type to use for this array. There's also the usage of the default_memory_pool() function, which returns the current global memory pool that this instance of Arrow has. The memory pool will get cleaned up at the process exit, and different pools can be created if needed, but in the majority of cases, just using the default one will be sufficient.

Now that we have our random number generator and our builder, let's create those arrays with random data:

#include <arrow/record_batch.h>
// previous code sections go here
constexpr auto ncols = 16;
constexpr auto nrows = 8192;
arrow::ArrayVector columns(ncols);
arrow::FieldVector fields;
for (int i = 0; i < ncols; ++i) {
     for (int j = 0; j < nrows; ++j) {
          builder.Append(d(gen));
     }
     auto status = builder.Finish(&columns[i]);
     if (!status.ok()) {
          std::cerr << status.message() << std::endl;
        // handle the error
    }
      fields.push_back(arrow::field("c" + std::to_string(i),
                       arrow::float64()));
}
auto rb = arrow::RecordBatch::Make(arrow::schema(fields), 
            columns[0]->length(), columns);
std::cout << rb->ToString() << std::endl;

The most important lines are highlighted showing the population of the arrays and the record batch creation. Calling Builder::Finish also resets the builder so that it can be re-used to build more arrays of the same type. We also use a vector of fields to construct a schema that we use to create the record batch. After this, we can perform whatever operations we wish on the record batch, such as rearranging, flattening, or unflattening columns, performing aggregations or calculations on the data, or maybe just calling ToString to write out the data to the terminal.

Building a struct array, again

When building nested type arrays in C++, it's a little more complex when working with the builders. We can do the same struct example for our archers that we did in Python! If you remember, a struct array is essentially just a collection of children arrays that are the same size and a validity bitmap. This means that one way to build a struct array would be to simply build each constituent array as previously, and construct the struct array using them:

  1. Let's first mention include and some using statements for convenience, along with our initial data:
    #include <arrow/api.h>
    using arrow::field; 
    using arrow::utf8; 
    using arrow::int16;
    // vectors of archer data to start with
    std::vector<std::string> archers{"Legolas", "Oliver", "Merida", "Lara", "Artemis"};
    std::vector<std::string> locations{"Mirkwood", "Star City", "Scotland", "London", "Greece"};
    std::vector<int16_t> years{1954, 1941, 2012, 1996, -600};
  2. Now, we construct the constituent Arrow arrays that will make up our final struct array:
    arrow::ArrayVector children;
    children.resize(3);
    arrow::StringBuilder str_bldr;
    str_bldr.AppendValues(archers); 
    str_bldr.Finish(&children[0]); // resets the builder
    str_bldr.AppendValues(locations); // re-use it!
    str_bldr.Finish(&children[1]);
    arrow::Int16Builder year_bldr;
    year_bldr.AppendValues(years);
    year_bldr.Finish(&children[2]);
  3. Finally, with our children arrays constructed, we can define the struct array:
    arrow::StructArray arr{arrow::struct_({
         field("archer", utf8()), 
         field("location", utf8()), 
         field("year", int16())}), 
         children[0]->length(), children};
    std::cout << arr.ToString() << std::endl;

You can see the similarities to the Python version. We create our struct array by creating the struct type and defining the fields and types for each field, and then just hand it references to the child arrays that it needs. Being able to do this makes building up or splitting apart struct arrays extremely efficient and easy to do, regardless of the complexity of the types. Also, remember that it's not copying the data; the resulting StructArray just references the children arrays instead.

Rather, if you have your data and want to build out the struct array from scratch, we can use StructBuilder. It's very similar to our previous builder example, except the builders for the individual fields are owned by StructBuilder itself and we can build them all up together at one time. This is pretty straightforward and easy if there are no null structs since the validity bitmap can be left out, but if there are any nulls, we need to make sure that the builder is aware of them in order to build the bitmap (see Figure 1.12 for a reminder of the layout of a struct array in memory):

  1. First, we create our data type:
    using arrow::field;
    std::shared_ptr<arrow::DataType> st_type = 
         arrow::struct_({field("archer", arrow::utf8()),
                         field("location", arrow::utf8()),
                         field("year", arrow::int16())});
  2. Now, we create our builder:
    std::unique_ptr<arrow::ArrayBuilder> tmp;
    // returns a status, handle the error case
    arrow::MakeBuilder(arrow::default_memory_pool(), 
                       st_type, &tmp); 
    std::shared_ptr<arrow::StructBuilder> builder;
    builder.reset(
    static_cast<arrow::StructBuilder*>(tmp.release()));

Some notes to keep in mind with the highlighted lines are as follows:

  • By using the MakeBuilder call as seen in the highlighted line, the builders for our fields will be automatically created for us. It will use the data type that is passed in to determine the correct builder type to construct.
  • Then, in the second highlighted line, we cast our pointer to ArrayBuilder to a StructBuilder pointer.
  1. Now we can append the data we need to, and since we know the types of the fields, we can just use the same technique of casting pointers in order to be able to use the field builders. Since they are all owned by the struct builder itself, we can just use raw pointers:
    using namespace arrow;
    StringBuilder* archer_builder = 
        static_cast< StringBuilder*>(builder->field_builder(0));
    StringBuilder* location_builder = 
        static_cast<StringBuilder*>(builder->field_builder(1));
    Int16Builder* year_builder = 
        static_cast<Int16Builder*>(builder->field_builder(2));
  2. Finally, now that we've got our individual builders, we can append whatever values we need to them as long as we make sure that when we call Finish on the struct builder, all of the field builders must have the same number of values. If there are any null structs, you can call the Append, AppendNull, or AppendValues functions on the struct builder to indicate which indexes are valid and which are null. Just as with the field builders, this must either be left out entirely (if there are no nulls) or equal to the same number of values in each of the fields.
  3. And, of course, the last step, just like before, is to call Finish on the struct builder:
    std::shared_ptr<arrow::Array> out;
    builder->Finish(&out);
    std::cout << out->ToString() << std::endl;

Now that we've covered building arrays in C++, here's an exercise for you to try out!

An exercise for you

Try doing the same exercise from the Python section but with C++, converting std::vector<row> to an Arrow record batch where row is defined as the following:

struct row {
    int64_t id;
    double cost;
    std::vector<double> cost_components;
};

Then, write a function to convert the record batch back into the row-oriented representation of std::vector<row>.

Go Arrow go!

The Golang Arrow library is the one I've been most directly involved in the development of and is also very easy to install and use, just like the pyarrow library. Most IDEs will have a plugin for developing in Go, so you can set up your preferred IDE and environment for writing code, and then the following commands will set you up with downloading the Arrow library for import:

$ mkdir arrow_chapter1 && cd arrow_chapter1
$ go mod init arrow_chapter1
$ go get -u github.com/apache/arrow/go/v7/[email protected]

Tip

If you're not familiar with Go, the Tour of Go is an excellent introduction to the language and can be found here: https://tour.golang.org/.

By this point, I think you can guess what our first example is going to be; just create a file with the .go extension in the directory you created:

package main
import (
    "fmt"
    "github.com/apache/arrow/go/v7/arrow/array"
    "github.com/apache/arrow/go/v7/arrow/memory"
)
func main() {
    bldr := array.NewInt64Builder(memory.DefaultAllocator)
    defer bldr.Release()
    bldr.AppendValues([]int64{1, 2, 3, 4}, nil)
    arr := bldr.NewArray()
    defer arr.Release()
    fmt.Println(arr)
}

Just as we started with the C++ and Python libraries, this is a minimal Go file that creates an Int64 array with the values [1, 2, 3, 4] and prints it out to the terminal. The builder pattern that we saw in the C++ library is also the same pattern that the Go library utilizes; the big difference between them is the highlighted lines. You can run the example with the go run command:

$ go run .
[1, 2, 3, 4]

Because Go is a garbage-collected language, there's less direct control over exactly when a value is cleaned up or memory is deallocated. While in C++ we have shared_ptr and unique_ptr objects, there is not an equivalent construct in Go. To allow that more granular control, the library adds function calls for Retain and Release on most of the constructs such as arrays. These present a way to perform reference counting on your values using Retain to ensure that the underlying data stays alive, particularly when passing through channels or other cases where internal memory might get undesirably garbage-collected, and Release to free up the internal references to the memory so it can get garbage-collected earlier than the array object itself. If you're unfamiliar with it, the defer keyword marks a function to be called just before the enclosing function, not necessarily the scope, ends. Calls that are deferred will execute in the reverse order that they appear in code, similar to C++ destructors.

Let's create the same second example, populating arrays with random data and creating a record batch:

  1. We can import the standard rand library for generating our random values. Technically, it generates a pseudo-random value between 0 and 1.0 (not including 1) that we could combine with some math to increase the range of values, but for the purposes of this example, that's not necessary:
    import (
            "math/rand"
            …
          "github.com/apache/arrow/go/v7/arrow"
          …
    )
  2. Next, we just create our builder that we can use and re-use to append values to, just like before:
    fltBldr := array.NewFloat64Builder(memory.DefaultAllocator)
    defer fltBldr.Release()

The usage of memory.DefaultAllocator is equivalent to the call of arrow::default_memory_pool in the C++ library, referring to a default allocator that is initialized for the process. Alternately, you could call memory.NewGoAllocator or otherwise.

  1. As in the C++ example, we need a list of column arrays and a list of fields to build a schema from, so let's create the slices:
    const ncols = 16
    columns := make([]arrow.Array, ncols)
    fields := make([]arrow.Field, ncols)
  2. Then, we can add our random data to the builder and create our columns:
    const nrows = 8192
    for i := range columns {
            for j := 0; j < nrows; j++ {
            fltBldr.Append(rand.Float64())
        }
        columns[i] = fltBlder.NewArray()
        defer columns[i].Release()
        fields[i] = arrow.Field{
                      Name: "c" + strconv.Itoa(i),
                      Type: arrow.PrimitiveTypes.Float64}
    }

As with the other two libraries, we need to specify the type for our field.

  1. Finally, we create our record batch and print it out:
    record := array.NewRecord(arrow.NewSchema(fields, nil),
                              columns, -1)
    defer record.Release()
    fmt.Println(record)

When creating the new record, we have to create a schema from the list of fields. The nil that is passed in there represents that we're not providing any schema-level metadata for this record. Schemas can contain metadata at the top level and each individual field can also contain metadata.

The -1 value that we pass is the numRows argument that also existed in the other libraries. We could have used columns[0].Len() to know the length, but by passing -1, we can have the number of rows lazily determined by the record itself rather than us having to pass it in.

We can see all the same conceptual trappings across different libraries:

  • Record batches are made up of a group of same length arrays with a schema.
  • A schema is a list of fields, where each field contains a name, type, and some metadata.
  • A single array knows its type and has the raw data, but a name and metadata must be tied to a Field object.

I bet you can guess the next example we're going to code up!

Building a struct array, yet again!

Building nested type arrays in Go is closer to the way it is done in the C++ library than in the Python library, but there are still similar steps of creating your struct type, populating each of the constituent arrays, and then finalizing it.

First, we create our type:

archerType := arrow.StructOf(
    arrow.Field{Name: "archer", Type: arrow.BinaryTypes.String},
    arrow.Field{Name: "location", Type: arrow.BinaryTypes.String},
    arrow.Field{Name: "year", Type: arrow.PrimitiveTypes.Int16})

Just like before, there's two ways to go about it:

  • Build each constituent array separately and then join references to them together into a single struct array:
    mem := memory.DefaultAllocator
    namesBldr := array.NewStringBuilder(mem)
    defer namesBldr.Release()
    locationsBldr := array.NewStringBuilder(mem)
    defer locationsBldr.Release()
    yearsBldr := array.NewInt16Builder(mem)
    defer yearsBldr.Release()
    // populate the builders and create the arrays named names,
    // locations, and years
    data := array.NewData(archerType, names.Len(), 
                          []*memory.Buffer{nil},
                          []arrow.ArrayData{names.Data(), 
                          locations.Data(), years.Data()}, 
                          0, 0)
    defer data.Release()
    archers := array.NewStructData(data)
    defer archers.Release()
    fmt.Println(archers)

Breaking down the highlighted line, which is something new, in both the C++ and Go libraries, there is the concept of ArrayData, which is contained within each array. It contains the pieces mentioned before that make up the array: the type, the buffers, the length, the null count, any children arrays, and the optional dictionary. In the highlighted line, we create a new Data object, which has its own reference count, and initialize it with the struct type we created, the length of our struct, and a slice made up of the pointers to the Data objects of each of the field arrays. Remember, struct arrays only have one buffer, a null bitmap, which can be left out if there are no nulls, so we pass a nil buffer as []*memory.Buffer{nil}.

  • The other option is to use a struct builder directly and build up all of the constituent arrays simultaneously. If you don't already have the arrays from something else, this is the easier and more efficient option:
    // archerType is the struct type from before, and lets 
    // assume the data is in a slice of archer structs 
    // named archerList
    bldr := array.NewStructBuilder(memory.DefaultAllocator,
                                   archerType)
    defer bldr.Release()
    f1b := bldr.FieldBuilder(0).(*array.StringBuilder)
    f2b := bldr.FieldBuilder(1).(*array.StringBuilder)
    f3b := bldr.FieldBuilder(2).(*array.Int16Builder)
    for _, ar := range archerList {
         bldr.Append(true)
         f1b.Append(ar.archer)
         f2b.Append(ar.location)
         f3b.Append(ar.year)
    }
    archers := bldr.NewStructArray()
    defer archers.Release()
    fmt.Println(archers)

Just like in the C++ example before, the field builders are owned by the struct builder itself, so we just assert the types of the appropriate builder so we can call Append on them.

In the last highlighted line, we call Append on the struct builder itself so the builder keeps track that it is a non-null struct we are adding. We could pass false there to tell the builder to add a null struct, or we can call the AppendNull function to do the same.

An exercise for you (yes, it's the same one)

Try using the Arrow library for Go to write a function that takes a row-oriented slice of structs and converts them into an Arrow record batch, and vice versa. Use the following type definition:

type datarow struct {
    id             int64
    cost           float64
    costComponents []float64
}

You should probably be pretty good at this by now if you did this exercise in the Python and C++ libraries already!

 

Summary

The goal of this chapter was to explain what Apache Arrow is, get you acquainted with the format, and have you use it in some simple use cases. This knowledge forms the baseline of everything else for us to talk about in the rest of the book!

Just as a reminder, you can check the GitHub repository (https://github.com/PacktPublishing/In-Memory-Analytics-with-Apache-Arrow-) for the solutions to the exercises presented here and for the full code samples to make sure you understand the concepts!

The previous examples and exercises are all fairly trivial and are meant to help reinforce the concepts introduced about the format and the specification while helping you get familiar with using Arrow in code.

In Chapter 2, Working with Key Arrow Specifications, we will introduce how to read your data into the Arrow format, whether it's on your local disk, Hadoop Distributed File System (HDFS), S3, or elsewhere, and integrate Arrow into some of the various processes and utilities you might already use with your data, such as the pandas integration. We will also discover how to pass your data around between services and processes while keeping it in the Arrow format for performance.

Ready? Onward and upward!

 

References

Here's a list of the URL references we made in this chapter since there were quite a lot!

  1. Apache Arrow documentation: https://arrow.apache.org/docs/
  2. Apache License 2.0: https://apache.org/licenses/LICENSE-2.0
  3. C++ Apache Arrow documentation: https://arrow.apache.org/docs/cpp/
  4. C# documentation for Arrow: https://github.com/apache/arrow/blob/master/csharp/README.md
  5. Golang documentation for Arrow: https://pkg.go.dev/github.com/apache/arrow/go/v7/arrow
  6. Java documentation for Arrow: https://arrow.apache.org/docs/java/
  7. JavaScript documentation for Arrow: https://arrow.apache.org/docs/js/
  8. Julia documentation for Arrow: https://arrow.juliadata.org/stable/
  9. Rust documentation for Arrow: https://docs.rs/crate/arrow/
  10. Glib documentation for Arrow: https://arrow.apache.org/docs/c_glib/
  11. MATLAB documentation for Arrow: https://github.com/apache/arrow/blob/master/matlab/README.md
  12. Python documentation for Arrow: https://arrow.apache.org/docs/python/
  13. R documentation for Arrow: https://arrow.apache.org/docs/r/
  14. Ruby documentation for Arrow: https://github.com/apache/arrow/blob/master/ruby/README.md
  15. Implementation matrix for Arrow features across languages: https://arrow.apache.org/docs/status.html
  16. Documentation for using Conda:https://docs.conda.io/projects/conda/en/latest/index.html
  17. Home page for Conda-Forge: https://conda-forge.org
  18. Data type documentation for PyArrow: https://arrow.apache.org/docs/python/api/datatypes.html#api-types
  19. Installation guide for MSYS2: https://www.msys2.org/#installation
  20. Home page for Brew.sh: https://brew.sh/

About the Author

  • Matthew Topol

    Matthew Topol is an Apache Arrow contributor and a principal software architect at FactSet Research Systems, Inc. Since joining FactSet in 2009, Matt has worked in both infrastructure and application development, led development teams, and architected large-scale distributed systems for processing analytics on financial data. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented games of fantasy for his victims—er—friends, and share his knowledge with anyone interested enough to listen.

    Browse publications by this author
In-Memory Analytics with Apache Arrow
Unlock this book and the full library FREE for 7 days
Start now