Reader small image

You're reading from  In-Memory Analytics with Apache Arrow

Product typeBook
Published inJun 2022
PublisherPackt
ISBN-139781801071031
Edition1st Edition
Concepts
Right arrow
Author (1)
Matthew Topol
Matthew Topol
author image
Matthew Topol

Matthew Topol is an Apache Arrow contributor and a principal software architect at FactSet Research Systems, Inc. Since joining FactSet in 2009, Matt has worked in both infrastructure and application development, led development teams, and architected large-scale distributed systems for processing analytics on financial data. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented games of fantasy for his victims—er—friends, and share his knowledge with anyone interested enough to listen.
Read more about Matthew Topol

Right arrow

Chapter 7: Using the Arrow Datasets API

In the current ecosystem of data lakes and lakehouses, many datasets are now huge collections of files in partitioned directory structures rather than a single file. To facilitate this workflow, the Arrow libraries provide an API for easily interacting with these types of structured and unstructured data. This is called the Datasets API and is designed to perform a lot of the heavy lifting for querying these types of datasets for you.

The Datasets API provides a series of utilities for easily interacting with large, distributed, and possibly partitioned datasets that are spread across multiple files. It also integrates very easily with the Compute APIs we covered previously, in Chapter 6, Leveraging the Arrow Compute APIs.

In this chapter, we will learn how to use the Arrow Datasets API for efficient querying of multifile, tabular datasets regardless of their location or format. We will also understand how to use the dataset classes and...

Technical requirements

As before, this chapter has a lot of code examples and exercises to drive home an understanding of using these libraries. You'll need an internet-connected computer with the following to try out the examples and follow along:

  • Python 3+: With the pyarrow module installed and the dataset submodule.
  • A C++ compiler supporting C++11 or higher: With the Arrow libraries installed and able to be included and linked against.
  • Your preferred coding IDE, such as Emacs, Vim, Sublime, or VS Code.
  • As before, you can find the full sample code in the accompanying GitHub repository at https://github.com/PacktPublishing/In-Memory-Analytics-with-Apache-Arrow-.
  • We're also going to utilize the NYC taxi dataset located in the public AWS S3 bucket at s3://ursa-labs-taxi-data/.

Querying multifile datasets

Note

While this section details the Datasets API in the Arrow libraries, it's important to note that this API is still considered experimental as of the time of writing. As a result, the APIs described are not yet guaranteed to be stable between version upgrades of Arrow and may change in some ways. Always check the documentation for the version of Arrow you're using. That said, the API is unlikely to change drastically unless requested by users, so it's being included due to its extreme utility.

To facilitate the very quick querying of data, modern datasets are often partitioned into multiple files across multiple directories. Many engines and utilities take advantage of this or read and write data in this format, such as Apache Hive, Dremio Sonar, Presto, and many AWS services. The Arrow datasets library provides functionality as a library for working with these sorts of tabular datasets, such as the following:

  • Providing a...

Filtering data programmatically

In the previous example, we created a scanner and then read the entire dataset. This time, we're going to muck around with the builder first to give it a filter to use before it starts reading the data. We'll also use the Project function to control what columns get read. Since we're using Parquet files, we can reduce the IO and memory usage by only reading the columns we want rather than reading all of them; we just need to tell the scanner that that's what we want.

In the previous section, we learned about the Arrow Compute API as a library for performing various operations and computations on Arrow-formatted data. It also includes objects and functionality for defining complex expressions referencing fields and calling functions. These expression objects can then be used in conjunction with the scanners to define simple or complex filters for our data. Before we dig into the scanner, let's take a quick detour to cover the...

Using the Datasets API in Python

Before you ask: yes, the datasets API is available in Python too! Let's do a quick rundown of all the same features we just covered, but using the pyarrow Python module instead of C++. Since the majority of data scientists utilize Python for their work, it makes sense to show off how to use these APIs in Python for easy integration with existing workflows and utilities. Since Python's syntax is simpler than C++, the code is much more concise, so we can run through everything really quickly in the following sections.

Creating our sample dataset

We can start by creating a similar sample dataset to what we were using for the C++ examples with three columns, but using Python:

>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import pathlib
>>> import numpy as np
>>> import os
>>> base = pathlib.Path(os.getcwd())
>>> (base / "parquet_dataset...

Streaming results

You'll recall from the beginning of this chapter, in the Querying multifile datasets section, that I mentioned this was the solution for when you had multiple files and the dataset was potentially too large to fit in memory all at one time. So far, the examples we've seen used the ToTable function to completely materialize the results in memory as a single Arrow table. If your results are too large to fit into memory all at one time, this obviously won't work. Even if your results could fit into memory, it's not the most efficient way to perform the query anyway. In addition to the ToTable (C++) or to_table (Python) function we've been calling, the scanner also exposes functions that return iterators for streaming record batches from the query.

To demonstrate the streaming, let's use a public AWS S3 bucket hosted by Ursa Labs, which contains about 10 years of NYC taxi trip record data in Parquet format. The URI for the dataset is s3...

Summary

By composing these various pieces together (the C Data API, Compute API, and Datasets API), and gluing infrastructure on top, anyone should be able to create a rudimentary query and analysis engine that is fairly performant right away. The functionality provided allows for abstracting away a lot of the tedious work for interacting with different file formats and handling different location sources of data, to provide a single interface that allows you to get right to work in building the specific logic you need. Once again, it's the fact that all these things are built on top of Arrow as an underlying format, which is particularly efficient for these operations, that allows them to all be so easily interoperable.

So, where do we go from here?

Well, you might remember in Chapter 3, Data Science with Apache Arrow, when discussing Open Database Connectivity (ODBC), I alluded to the idea of something that might be able to replace ODBC and JDBC as universal protocols...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
In-Memory Analytics with Apache Arrow
Published in: Jun 2022Publisher: PacktISBN-13: 9781801071031
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Matthew Topol

Matthew Topol is an Apache Arrow contributor and a principal software architect at FactSet Research Systems, Inc. Since joining FactSet in 2009, Matt has worked in both infrastructure and application development, led development teams, and architected large-scale distributed systems for processing analytics on financial data. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented games of fantasy for his victims—er—friends, and share his knowledge with anyone interested enough to listen.
Read more about Matthew Topol