You're reading from In-Memory Analytics with Apache Arrow

Product type Book

Published in Jun 2022

Publisher Packt

ISBN-13 9781801071031

Pages 392 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Matthew Topol

Table of Contents (16) Chapters

Preface

Section 1: Overview of What Arrow Is, its Capabilities, Benefits, and Goals

Chapter 1: Getting Started with Apache Arrow

Chapter 2: Working with Key Arrow Specifications

Chapter 3: Data Science with Apache Arrow

Section 2: Interoperability with Arrow: pandas, Parquet, Flight, and Datasets

Chapter 4: Format and Memory Handling

Chapter 5: Crossing the Language Barrier with the Arrow C Data API

Chapter 6: Leveraging the Arrow Compute APIs

Chapter 7: Using the Arrow Datasets API

Chapter 8: Exploring Apache Arrow Flight RPC

Section 3: Real-World Examples, Use Cases, and Future Development

Chapter 9: Powered by Apache Arrow

Chapter 10: How to Leave Your Mark on Arrow

Chapter 11: Future Development and Plans

Other Books You May Enjoy

Chapter 5: Crossing the Language Barrier with the Arrow C Data API

Not to sound like a broken record, but I've said several times already that Apache Arrow is a collection of libraries rather than one single library. This is an important distinction from both a technical standpoint and a logistical one. From a technical standpoint, it means that third-party projects that depend on Arrow don't need to use the entirety of the project and instead can only link against, embed, or otherwise include the portions of the project they need. This allows for smaller binaries and a smaller surface area of dependencies. From a logistical standpoint, it allows the Arrow project to pivot easily and move in potentially experimental directions without making large, project-wide changes.

As the goal of the Arrow project is to create a collection of tools and libraries that can be shared across the data analytics and data science ecosystems with a shared in-memory representation, there are...

Technical requirements

This chapter is intended to be highly technical, with various code examples and exercises diving into the usage of the different Arrow libraries. As such, like before, you need access to a computer with the following software to follow along:

Python 3+ – the pyarrow module installed and importable
Go 1.16+
A C++ compiler supporting C++11 or better
Your preferred IDE – Sublime, Visual Studio Code, Emacs, and so on

Using the Arrow C data interface

Back in Chapter 2, Working with Key Arrow Specifications, I mentioned the Arrow C data interfaces in regard to the communication of data between Python and Spark processes. At that point, we didn't go much into detail about the interface or what it looks like; now, we will.

Because the Arrow project is fast-moving and evolving, it can sometimes be difficult for other projects to incorporate the Arrow libraries into their work. There's also the case where there might be a lot of existing code that needs to be adapted to work with Arrow piecemeal, leading to you having to create or even re-implement adapters for interchanging data. To avoid redundant efforts across these situations, the Arrow project defines a very small, stable set of C definitions that can be copied into a project to allow to easily pass data across the boundaries of different languages and libraries. For languages and runtimes that aren't C or C++, it should still...

Example use cases

One significant proposed benefit of having the C Data API was to allow applications to implement the API without requiring a dependency on the Arrow libraries. Let's suppose there is an existing computational engine written in C++ that wants to add the ability to return data in the Arrow format without adding a new dependency or having to link with the Arrow libraries. There are many possible reasons why you might want to avoid adding a new dependency to a project. This could range from the development environment to the complexity of deployment mechanisms, but we're not going to focus on that side of it.

Using the C Data API to export Arrow-formatted data

Do you have your development environment all set up for C++? If not, go and do that and come back. You know the drill; I'll wait.

We'll start with a small function to generate a vector of random 32-bit integers, which will act as our sample data. You know how to do that? Well, good....

Streaming across the C Data API

This particularly useful interface is considered experimental by the Arrow project currently, so technically, the ABI is not guaranteed to be stable but is unlikely to change unless user feedback proposes improvements to it. The C streaming API is a higher-level abstraction built on the initial ArrowSchema and ArrowArray structures to make it easier to stream data within a process across API boundaries. The design of the stream is to expose a chunk-pulling API that pulls blocks of data from the source one at a time, all with the same schema. The structure is defined as follows:

struct ArrowArrayStream {

  // callbacks for stream functionality

  int (*get_schema)(struct ArrowArrayStream*, struct ArrowSchema*);

  int (*get_next)(struct ArrowArrayStream*, struct ArrowArray*);

  const char* (*get_last_error)(struct ArrowArrayStream*);

  // Release callback and private data

...

Other use cases

In addition to providing an interface for zero-copy sharing of Arrow data between components, the C Data API can also be used in cases where it may not be feasible to depend on the Arrow libraries directly.

Despite a large number of languages and runtimes sporting implementations of Arrow, there are still languages or environments that do not have Arrow implementations. This is particularly true in organizations with a lot of legacy software and/or specialized environments. A great example of this would be the fact that the dominant programming language in the astrophysical modeling of stars and galaxies is still Fortran! Unsurprisingly, there is not an existing Arrow implementation for Fortran. In these situations, it is often not feasible to rewrite entire code bases so that you can leverage Arrow in a supported language. But with the C Data API, data can be shared from a supported runtime to a pre-existing unsupported code base. Alternatively, you can do the...

Summary

For this foray into the Arrow libraries, we've explored the efficient sharing of data between libraries using the Arrow C data interface. Remember that the motivation for this interface was for zero-copy data sharing between components of the same running process. It's not intended for the C Data API itself to mimic the features available in higher-level languages such as C++ or Python – just to share data. In addition, if you're sharing between different processes or need persistent storage, you should be using the Arrow IPC format that we covered in Chapter 4, Format and Memory Handling.

At this point, we've covered lots of ways to read, write, and transfer Arrow data. But once you have the data in memory, you're going to want to perform operations on it and take advantage of the benefits of in-memory analytics. Rather than having to re-implement the mathematical and relational algorithms yourself, in Chapter 6, Leveraging the Arrow Compute...