Reader small image

You're reading from  In-Memory Analytics with Apache Arrow

Product typeBook
Published inJun 2022
PublisherPackt
ISBN-139781801071031
Edition1st Edition
Concepts
Right arrow
Author (1)
Matthew Topol
Matthew Topol
author image
Matthew Topol

Matthew Topol is an Apache Arrow contributor and a principal software architect at FactSet Research Systems, Inc. Since joining FactSet in 2009, Matt has worked in both infrastructure and application development, led development teams, and architected large-scale distributed systems for processing analytics on financial data. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented games of fantasy for his victims—er—friends, and share his knowledge with anyone interested enough to listen.
Read more about Matthew Topol

Right arrow

Chapter 4: Format and Memory Handling

I've continuously extolled the virtues of using Arrow as a data interchange technology for tabular data, but how does it stack up against the more common technologies that people tend to utilize for transferring data? When does it make sense to use one over the other for your application programming interface (API)? The answer is to know how different technologies utilize memory. Clever management of your memory can be the key to performant processes. To decide which format to use for your data, you need to understand which use cases your options were designed for. With that in mind, you can take advantage of the runtime properties of the most common data transport formats such as Protocol Buffers (Protobuf), JavaScript Object Notation (JSON), and FlatBuffers when appropriate. By understanding how to utilize memory in your program, you can process very large amounts of data with minimal memory overhead.

We're going to cover the following...

Technical requirements

As before, you'll need an internet-connected computer with the following software so that you can follow along with the code examples here:

  • Python 3.7 or higher, with the pyarrow and pandas modules installed
  • Go 1.16 or higher
  • A C++ compiler capable of compiling C++11 or higher, with the Arrow libraries installed
  • Your preferred IDE, such as Emacs, Sublime, or VS Code
  • A web browser

Storage versus runtime in-memory versus message-passing formats

When we're talking about formats for representing data, there are a few different, complementary, yet competing things we typically are trying to optimize. We can generally (over-) simplify this by talking about three main components, as follows:

  • Size—The final size of the data representation
  • Serialize/deserialize speed—The performance for converting data between the formats and something that can be used in-memory for computations
  • Ease of use—A catch-all category regarding readability, compatibility, features, and so on

How we choose to optimize between these components is usually going to be heavily dependent upon the use case for that format. When it comes to working with data, there are three high-level use case descriptions I tend to group most situations into: long-term storage, in-memory runtime processing, and message passing. Yes—these groupings are quite...

Passing your Arrows around

Since Arrow is designed to be easily passable between processes, regardless of whether they are locally on the same machine or not, the interfaces for passing around record batches are referred to as IPC libraries for Arrow. If the processes happen to be on the same machine, then it's possible to share your data without performing any copies at all!

What is this sorcery?!

First things first. There are two types of binary formats defined for sharing record batches between processes—a streaming format and a random access format, as outlined in more detail here:

  • The streaming format exists for sending a sequence of record batches of an arbitrary length. It must be processed from start to end; you can't get random access to a particular record batch in the stream without processing all of the ones before it.
  • The random access—or file—format is for sharing a known number of record batches. Because it supports random...

Learning about memory cartography

One draw of distributed systems such as Apache Spark is the ability to process very large datasets quickly. Sometimes, the dataset is so large that it can't even fit entirely in memory on a single machine! Having a distributed system that can break up the data into chunks and process them in parallel is then necessary since no individual machine would be able to load the whole dataset in memory at one time to operate on it. But what if you could process a huge, multiple-GB file while using almost no RAM at all? That's where memory mapping comes in.

Let's look to our NYC Taxi dataset once again for help with demonstrating this concept. The file named yellow_tripdata_2015-01.csv is approximately 1.8 GB in size, perfect to use as an example. By now, you should easily be able to read that CSV file in as an Arrow table and look at the schema. Now, let's say we wanted to find out and calculate the mean of the values in the total_amount...

Summary

If you are building up data pipelines and large systems, regardless of whether you are a data scientist or a software architect, you're going to have to make a lot of decisions regarding which formats to use for various pieces of the system. You always want to choose the best format for the use case, and not just pick the latest trends and apply them everywhere. Many people hear about Arrow and either react by thinking that they need to use it everywhere for everything, or they wonder why we needed yet another data format. The key takeaway I want you to understand is the differences in the problems that are trying to be solved.

If you need longer-term persistent storage either on disk or in the cloud, you typically want a storage format such as Parquet, ORC, or CSV, with the primary access cost being I/O time for these use cases, so you want to optimize to reduce that based on your access patterns. If you're passing small messages around, such as metadata or control...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
In-Memory Analytics with Apache Arrow
Published in: Jun 2022Publisher: PacktISBN-13: 9781801071031
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Matthew Topol

Matthew Topol is an Apache Arrow contributor and a principal software architect at FactSet Research Systems, Inc. Since joining FactSet in 2009, Matt has worked in both infrastructure and application development, led development teams, and architected large-scale distributed systems for processing analytics on financial data. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented games of fantasy for his victims—er—friends, and share his knowledge with anyone interested enough to listen.
Read more about Matthew Topol