Reader small image

You're reading from  In-Memory Analytics with Apache Arrow

Product typeBook
Published inJun 2022
PublisherPackt
ISBN-139781801071031
Edition1st Edition
Concepts
Right arrow
Author (1)
Matthew Topol
Matthew Topol
author image
Matthew Topol

Matthew Topol is an Apache Arrow contributor and a principal software architect at FactSet Research Systems, Inc. Since joining FactSet in 2009, Matt has worked in both infrastructure and application development, led development teams, and architected large-scale distributed systems for processing analytics on financial data. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented games of fantasy for his victims—er—friends, and share his knowledge with anyone interested enough to listen.
Read more about Matthew Topol

Right arrow

Querying multifile datasets

Note

While this section details the Datasets API in the Arrow libraries, it's important to note that this API is still considered experimental as of the time of writing. As a result, the APIs described are not yet guaranteed to be stable between version upgrades of Arrow and may change in some ways. Always check the documentation for the version of Arrow you're using. That said, the API is unlikely to change drastically unless requested by users, so it's being included due to its extreme utility.

To facilitate the very quick querying of data, modern datasets are often partitioned into multiple files across multiple directories. Many engines and utilities take advantage of this or read and write data in this format, such as Apache Hive, Dremio Sonar, Presto, and many AWS services. The Arrow datasets library provides functionality as a library for working with these sorts of tabular datasets, such as the following:

  • Providing a...
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
In-Memory Analytics with Apache Arrow
Published in: Jun 2022Publisher: PacktISBN-13: 9781801071031

Author (1)

author image
Matthew Topol

Matthew Topol is an Apache Arrow contributor and a principal software architect at FactSet Research Systems, Inc. Since joining FactSet in 2009, Matt has worked in both infrastructure and application development, led development teams, and architected large-scale distributed systems for processing analytics on financial data. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented games of fantasy for his victims—er—friends, and share his knowledge with anyone interested enough to listen.
Read more about Matthew Topol