You're reading from In-Memory Analytics with Apache Arrow

Product type Book

Published in Jun 2022

Publisher Packt

ISBN-13 9781801071031

Pages 392 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Matthew Topol

Table of Contents (16) Chapters

Preface

Section 1: Overview of What Arrow Is, its Capabilities, Benefits, and Goals

Chapter 1: Getting Started with Apache Arrow

Chapter 2: Working with Key Arrow Specifications

Chapter 3: Data Science with Apache Arrow

Section 2: Interoperability with Arrow: pandas, Parquet, Flight, and Datasets

Chapter 4: Format and Memory Handling

Chapter 5: Crossing the Language Barrier with the Arrow C Data API

Chapter 6: Leveraging the Arrow Compute APIs

Chapter 7: Using the Arrow Datasets API

Chapter 8: Exploring Apache Arrow Flight RPC

Section 3: Real-World Examples, Use Cases, and Future Development

Chapter 9: Powered by Apache Arrow

Chapter 10: How to Leave Your Mark on Arrow

Chapter 11: Future Development and Plans

Other Books You May Enjoy

Preface

To quote a famous blue hedgehog, Gotta Go Fast! When it comes to data, speed is important. It doesn't matter if you're collecting or analyzing data or developing utilities for others to do so, performance and efficiency are going to be huge factors in your technology choices, not just in the efficiency of the software itself, but also in development time. You need the right tools and the right technology, or you're dead in the water.

The Apache Arrow ecosystem is developer-centric, and this book is no different. Get started with understanding what Arrow is and how it works, then learn how to utilize it in your projects. You'll find code examples, explanations, and diagrams here, all with the express purpose of helping you learn. You'll integrate your data sources with Python DataFrame libraries such as pandas or NumPy and utilize Arrow Flight to create efficient data services.

With real-world datasets, you'll learn how to leverage Apache Arrow with Apache Spark and other technologies. Apache Arrow's format is language-independent and organized so that analytical operations are performed extremely quickly on modern CPU and GPU hardware. Join the industry adoption of this open source data format and save yourself valuable development time creating high-performant, memory-efficient, analytical workflows.

This book has been a labor of love to share knowledge. I hope you learn a lot from it! I sure did when writing it.

Who this book is for

This book is for developers, data analysts, and data scientists looking to explore the capabilities of Apache Arrow from the ground up. This book will also be useful for any engineers who are working on building utilities for data analytics, query engines, or otherwise working with tabular data, regardless of the language they are programming in.

What this book covers

Chapter 1, Getting Started with Apache Arrow, introduces you to the basic concepts underpinning Apache Arrow. It introduces and explains the Arrow format and the data types it supports, along with how they are represented in memory. Afterward, you'll set up your development environment and run some simple code examples showing the basic operation of Arrow libraries.

Chapter 2, Working with Key Arrow Specifications, continues your introduction to Apache Arrow by explaining how to read both local and remote data files using different formats. You'll learn how to integrate Arrow with the Python pandas library and how to utilize the zero-copy aspects of Arrow to share memory for performance.

Chapter 3, Data Science with Apache Arrow, wraps up our initial overview by providing specific examples to enhance data science workflows. This will include practical examples of using Arrow with Apache Spark and Jupyter, along with using Arrow-formatted data to create a chart. This will be followed by a brief discussion on Open Database Connectivity (ODBC) and an end-to-end demonstration of ingesting Arrow-formatted data into an Elasticsearch index and then querying it.

Chapter 4, Format and Memory Handling, discusses the relationships between Apache Arrow and Parquet, Feather, Protocol Buffers, JSON, and CSV data, along with when and why to use these different formats. Following this, the Arrow IPC format is introduced and described, along with an explanation of using memory mapping to further improve performance.

Chapter 5, Crossing the Language Barrier with the Arrow C Data API, introduces the titular C Data API for efficiently passing Apache Arrow data between different language runtimes. This chapter will cover the struct definitions utilized for this interface along with describing use cases that make it beneficial.

Chapter 6, Leveraging the Arrow Compute APIs, describes how to utilize the Arrow Compute APIs in both C++ and Python. You'll learn when and why you should use the Compute APIs to perform analytics rather than implement something yourself.

Chapter 7, Using the Arrow Datasets API, demonstrates querying, filtering, and otherwise interacting with multi-file datasets that can potentially be across multiple sources. Partitioned datasets are also covered, along with utilizing the Arrow Compute API to perform streaming filtering and other operations on the data.

Chapter 8, Exploring Apache Arrow Flight RPC, examines the Flight RPC protocol and its benefits. You will be walked through building a simple Flight server and client in multiple languages to produce and consume tabular data.

Chapter 9, Powered By Apache Arrow, provides a few examples of current real-world usage of Arrow, such as Dremio and Spice.ai.

Chapter 10, How to Leave Your Mark on Arrow, provides a brief introduction to contributing to open source in general, but specifically, how to contribute to the Arrow project itself. You will be walked through finding starter issues and setting up your first pull request to make a contribution, and what to expect when doing so. To that end, this chapter also contains various instructions on locally building the Arrow C++, Python, and Go libraries to test your contribution.

Chapter 11, Future Development and Plans, wraps up the book by examining the features that are still in heavy development at the time of writing. FlightSQL, DataFusion, and Substrait are all briefly explained and covered here with what to look forward to and, potentially, contribute to. Finally, there are some parting words and a challenge from me to you.

To get the most out of this book

It is assumed that you have a basic understanding of writing code in at least one of C++, Python, or Go to benefit from and use the code snippets. You should know how to compile and run code in the desired language. Some familiarity with the basic concepts of data analysis will help you to get the most out of this book. Beyond this, concepts such as tabular data and installing software on your machine are assumed to be understood.

The sample data is in the book's GitHub repository. You'll need to use Git Large File Storage (LFS) or a browser to download the large data files. There are also a couple of large sample data files in a publicly accessible AWS S3 bucket. The book will provide a link to download the files when necessary. Code examples are provided in C++, Python, and Go.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Take your time, enjoy, and experiment in all kinds of ways, and please, have fun with the exercises.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/In-Memory-Analytics-with-Apache-Arrow-. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801071031_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "When we call ListFlights, it returns a stream that we can then use to retrieve each one of our FlightInfo objects."

A block of code is set as follows:

...

    // add these imports

    "fmt"

    "github.com/apache/arrow/go/v8/arrow/arrio"

...

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

...

flights = list(client.list_flights(b'2009'))

data = client.do_get(flights[0].endpoints[0].ticket)

print(data.read_all())

Any command-line input or output is written as follows:

$ pip install pyodbc

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "After clicking the button, you'll have a window pop open; click the Save button in the bottom-right corner."

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read In-Memory Analytics with Apache Arrow, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.