Reader small image

You're reading from  In-Memory Analytics with Apache Arrow

Product typeBook
Published inJun 2022
PublisherPackt
ISBN-139781801071031
Edition1st Edition
Concepts
Right arrow
Author (1)
Matthew Topol
Matthew Topol
author image
Matthew Topol

Matthew Topol is an Apache Arrow contributor and a principal software architect at FactSet Research Systems, Inc. Since joining FactSet in 2009, Matt has worked in both infrastructure and application development, led development teams, and architected large-scale distributed systems for processing analytics on financial data. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented games of fantasy for his victims—er—friends, and share his knowledge with anyone interested enough to listen.
Read more about Matthew Topol

Right arrow

Chapter 9: Powered by Apache Arrow

Apache Arrow is becoming the industry standard as more and more projects adopt and/or support it for their internal and external communication formats. In this chapter, we're going to take a look at a few projects that are using Arrow in different ways. With the flexibility that Arrow provides, it is able to serve a variety of use cases in different environments, and many developers are taking advantage of that. Of course, Arrow is used in many different analytical engine projects, but it is also used in other contexts ranging from machine learning (ML) to data visualization in the browser.

With new projects and uses popping up all the time, it only makes sense to give a small overview of a selection of some of those projects. In this chapter, you're going to see a couple of different use cases for how Arrow is being used in the wild. These include the following:

  • A distributed SQL query engine named Dremio Sonar, which we just...

Swimming in data with Dremio Sonar

The roots of Arrow can be found in the ValueVector objects from the Apache Drill project, a SQL query engine for Hadoop, NoSQL, and cloud storage. Dremio Sonar was originally built out of Apache Drill and Dremio's founders co-created Arrow. Arrow is used by Dremio Sonar as the internal memory representation for its query and calculation engine, which helps power its performance. Since its inception, Dremio's engineers have made many contributions to the Arrow project resulting in significant innovations. First, let's look at the architecture used and where Arrow fits in.

Clarifying Dremio Sonar's architecture

As a distributed query engine, Dremio Sonar can be deployed in many different environments and scenarios. However, at its core, it has a pretty simple architecture, as shown in Figure 9.1. Being distributed, it can scale horizontally by increasing the number of Coordinators and Executors that handle the planning and...

Spicing up your ML workflows

Among the various fields of engineering that work with very large sets of data, one field that deals with processing some of the largest datasets would be ML and AI workflows. However, if your full-time job isn't ML, and you don't have the support of a dedicated ML team, it can often be very difficult to create an application that can learn and adapt. This is where a group of engineers decided to step in and make it easier for developers to create intelligent and adapting applications. Spice AI (https://spiceai.io) is, at the time of writing, a venture-capital-funded start-up that is working to create a platform to make it easier for developers to create AI-driven applications that can adapt and learn. They've open-sourced a product on GitHub called Spice.ai (https://github.com/spiceai/spiceai). It is currently in alpha development and utilizes Apache Arrow, Arrow Flight, as well as Dremio Sonar for its data processing and transport (https...

Arrow in the browser using JavaScript

One of the most common ways to currently deploy an application to consumers is by developing a web application. You can provide an application intended for mobile phones, tablets, or laptop/desktop browsers all in one location. When it comes to building modern interactive applications on the web, you can be sure that JavaScript and/or TypeScript are going to be involved somewhere. Now that we've covered some examples of services and systems utilizing Arrow, we'll cover a couple of projects that are leveraging Arrow front-and-center right in the browser.

Gaining a little perspective

In Chapter 3, Data Science with Apache Arrow, we briefly touched on a library named Perspective in the context of a widget for Jupyter notebooks. Perspective was originally developed at J.P. Morgan and was then open-sourced under the Apache Open Source License 2.0 through the Fintech Open Source Foundation (FINOS). Perspective is written in C++ and compiled...

Summary

It doesn't matter what the shape or form of your data is, if you're going to be doing any sort of processing or manipulation of the data, then it pays to see whether Arrow can enhance your workflows. In this chapter, we've seen relational databases, analytical engines, and visualization libraries all powered by Apache Arrow. In each case, Arrow was being leveraged for a smaller memory footprint and generally better resource utilization than what had previously been done.

Every industry has a need for processing large amounts of data extremely quickly, from brand new scientific research to manufacturing metrics. If you are doing work with data processing, you can probably leverage Arrow somewhere in your pipeline. If you don't believe me, have a gander at the projects listed on the official Apache Arrow website as powered by Arrow: https://arrow.apache.org/powered_by/. You'll find every project mentioned in this chapter on that list, along with many...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
In-Memory Analytics with Apache Arrow
Published in: Jun 2022Publisher: PacktISBN-13: 9781801071031
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Matthew Topol

Matthew Topol is an Apache Arrow contributor and a principal software architect at FactSet Research Systems, Inc. Since joining FactSet in 2009, Matt has worked in both infrastructure and application development, led development teams, and architected large-scale distributed systems for processing analytics on financial data. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented games of fantasy for his victims—er—friends, and share his knowledge with anyone interested enough to listen.
Read more about Matthew Topol