You're reading from In-Memory Analytics with Apache Arrow

Product typeBook

Published inJun 2022

PublisherPackt

ISBN-139781801071031

Edition1st Edition

Tools

Apache Camel Apache Camel

Concepts

Big Data

Author (1)

Matthew Topol

Chapter 7: Using the Arrow Datasets API

In the current ecosystem of data lakes and lakehouses, many datasets are now huge collections of files in partitioned directory structures rather than a single file. To facilitate this workflow, the Arrow libraries provide an API for easily interacting with these types of structured and unstructured data. This is called the Datasets API and is designed to perform a lot of the heavy lifting for querying these types of datasets for you.

The Datasets API provides a series of utilities for easily interacting with large, distributed, and possibly partitioned datasets that are spread across multiple files. It also integrates very easily with the Compute APIs we covered previously, in Chapter 6, Leveraging the Arrow Compute APIs.

In this chapter, we will learn how to use the Arrow Datasets API for efficient querying of multifile, tabular datasets regardless of their location or format. We will also understand how to use the dataset classes and...

Technical requirements

As before, this chapter has a lot of code examples and exercises to drive home an understanding of using these libraries. You'll need an internet-connected computer with the following to try out the examples and follow along:

Python 3+: With the pyarrow module installed and the dataset submodule.
A C++ compiler supporting C++11 or higher: With the Arrow libraries installed and able to be included and linked against.
Your preferred coding IDE, such as Emacs, Vim, Sublime, or VS Code.
As before, you can find the full sample code in the accompanying GitHub repository at https://github.com/PacktPublishing/In-Memory-Analytics-with-Apache-Arrow-.
We're also going to utilize the NYC taxi dataset located in the public AWS S3 bucket at s3://ursa-labs-taxi-data/.

Querying multifile datasets

Note

While this section details the Datasets API in the Arrow libraries, it's important to note that this API is still considered experimental as of the time of writing. As a result, the APIs described are not yet guaranteed to be stable between version upgrades of Arrow and may change in some ways. Always check the documentation for the version of Arrow you're using. That said, the API is unlikely to change drastically unless requested by users, so it's being included due to its extreme utility.

To facilitate the very quick querying of data, modern datasets are often partitioned into multiple files across multiple directories. Many engines and utilities take advantage of this or read and write data in this format, such as Apache Hive, Dremio Sonar, Presto, and many AWS services. The Arrow datasets library provides functionality as a library for working with these sorts of tabular datasets, such as the following:

Providing a...

Filtering data programmatically

In the previous example, we created a scanner and then read the entire dataset. This time, we're going to muck around with the builder first to give it a filter to use before it starts reading the data. We'll also use the Project function to control what columns get read. Since we're using Parquet files, we can reduce the IO and memory usage by only reading the columns we want rather than reading all of them; we just need to tell the scanner that that's what we want.

In the previous section, we learned about the Arrow Compute API as a library for performing various operations and computations on Arrow-formatted data. It also includes objects and functionality for defining complex expressions referencing fields and calling functions. These expression objects can then be used in conjunction with the scanners to define simple or complex filters for our data. Before we dig into the scanner, let's take a quick detour to cover the...

Using the Datasets API in Python

Before you ask: yes, the datasets API is available in Python too! Let's do a quick rundown of all the same features we just covered, but using the pyarrow Python module instead of C++. Since the majority of data scientists utilize Python for their work, it makes sense to show off how to use these APIs in Python for easy integration with existing workflows and utilities. Since Python's syntax is simpler than C++, the code is much more concise, so we can run through everything really quickly in the following sections.

Creating our sample dataset

We can start by creating a similar sample dataset to what we were using for the C++ examples with three columns, but using Python:

>>> import pyarrow as pa

>>> import pyarrow.parquet as pq

>>> import pathlib

>>> import numpy as np

>>> import os

>>> base = pathlib.Path(os.getcwd())

>>> (base / "parquet_dataset...

Streaming results

You'll recall from the beginning of this chapter, in the Querying multifile datasets section, that I mentioned this was the solution for when you had multiple files and the dataset was potentially too large to fit in memory all at one time. So far, the examples we've seen used the ToTable function to completely materialize the results in memory as a single Arrow table. If your results are too large to fit into memory all at one time, this obviously won't work. Even if your results could fit into memory, it's not the most efficient way to perform the query anyway. In addition to the ToTable (C++) or to_table (Python) function we've been calling, the scanner also exposes functions that return iterators for streaming record batches from the query.

To demonstrate the streaming, let's use a public AWS S3 bucket hosted by Ursa Labs, which contains about 10 years of NYC taxi trip record data in Parquet format. The URI for the dataset is s3...

Summary

By composing these various pieces together (the C Data API, Compute API, and Datasets API), and gluing infrastructure on top, anyone should be able to create a rudimentary query and analysis engine that is fairly performant right away. The functionality provided allows for abstracting away a lot of the tedious work for interacting with different file formats and handling different location sources of data, to provide a single interface that allows you to get right to work in building the specific logic you need. Once again, it's the fact that all these things are built on top of Arrow as an underlying format, which is particularly efficient for these operations, that allows them to all be so easily interoperable.

So, where do we go from here?

Well, you might remember in Chapter 3, Data Science with Apache Arrow, when discussing Open Database Connectivity (ODBC), I alluded to the idea of something that might be able to replace ODBC and JDBC as universal protocols...

The rest of the chapter is locked

You have been reading a chapter from

In-Memory Analytics with Apache Arrow

Published in: Jun 2022Publisher: PacktISBN-13: 9781801071031

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Matthew Topol

Matthew Topol is an Apache Arrow contributor and a principal software architect at FactSet Research Systems, Inc. Since joining FactSet in 2009, Matt has worked in both infrastructure and application development, led development teams, and architected large-scale distributed systems for processing analytics on financial data. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented games of fantasy for his victims—er—friends, and share his knowledge with anyone interested enough to listen.
Read more about Matthew Topol

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages