You're reading from Simplifying Data Engineering and Analytics with Delta

Product type Book

Published in Jul 2022

Publisher Packt

ISBN-13 9781801814867

Pages 334 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Anindita Mahapatra

Table of Contents (18) Chapters

Preface

Section 1 – Introduction to Delta Lake and Data Engineering Principles

Chapter 1: Introduction to Data Engineering

Chapter 2: Data Modeling and ETL

Chapter 3: Delta – The Foundation Block for Big Data

Section 2 – End-to-End Process of Building Delta Pipelines

Chapter 4: Unifying Batch and Streaming with Delta

Chapter 5: Data Consolidation in Delta Lake

Chapter 6: Solving Common Data Pattern Scenarios with Delta

Chapter 7: Delta for Data Warehouse Use Cases

Chapter 8: Handling Atypical Data Scenarios with Delta

Chapter 9: Delta for Reproducible Machine Learning Pipelines

Chapter 10: Delta for Data Products and Services

Section 3 – Operationalizing and Productionalizing Delta Pipelines

Chapter 11: Operationalizing Data and ML Pipelines

Chapter 12: Optimizing Cost and Performance with Delta

Chapter 13: Managing Your Data Journey

Other Books You May Enjoy

Chapter 10: Delta for Data Products and Services

"At the heart of every product person, there's a desire to make someone's life easier or simpler. If we listen to the customer and give them what they need, they'll reciprocate with love and loyalty to your brand."

– Francis Brown, Product Development Manager at Alaska Airlines

In the previous chapters, we saw how Delta helps not only with data engineering tasks but also with machine learning (ML) tasks because data is the core of all ML initiatives. Data as a Service (DaaS) refers to data products being made available to users on demand. The popularity of microservices and APIs has facilitated access to data on a need and privilege basis. These users can be within the organization or outside vendors and partners. The advantage is that the consumption pattern is greatly simplified and standardized and the internal complexities of pipelines and data stores are abstracted...

Technical requirements

To follow along with this chapter, make sure you have the code and instructions as detailed on GitHub here:

https://github.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/tree/main/Chapter10

Let's get started!

DaaS

Every company wants to be data-driven but just collecting data doesn't make you a data-driven enterprise. However, having actionable customer-centric insights does. If there is a data provider that can do the heavy lifting for easy consumption, then every business can use its services to be more competitive and truly data-driven. Being a DaaS provider is hard because taming raw data is a herculean task and it takes a lot of planning and execution to pull it off. The typical activities to manage data include the following:

Data collection to ensure quality and timely data
Data aggregation to summarize data in well-known dimensions for actionable insights and to avoid analysis paralysis
Data correlation for proper data and risk modeling to use the predictive value inherent in the datasets
Qualitative analysis to ensure that there is statistical significance and insights generated from the data that can be relied upon
Advanced BI and AI analytics to...

The need for data democratization

Data democratization refers to the process of making data available to all relevant stakeholders to consume as is or add further value. This is critical for all businesses as it forces agile, data-driven decision making and helps them to remain competitive using actual metrics and data-centric strategies, as well as providing monetization and innovation opportunities. Let's take a look at a few concrete examples:

Healthcare and manufacturing: A new category of medical imaging device has been introduced into the market. A lot of vendors and hospitals buy these devices. The images, their quality, and their predictive power in aiding doctors to detect the onset of tumors and cancers based on certain positions and circumstances generate data points that need to be analyzed to see what positions and settings lead to the best diagnosis. The more data, the better the analysis and the quicker the feedback loop to provide to the manufacturer to...

Delta for unstructured data

The vast majority of data in the world is unstructured – estimated by analysts to be 80 percent of all data they generate or otherwise acquire while doing business. Video, audio, or image files, as well as log files, sensors, or social media posts, all qualify as unstructured data and it is growing at a faster pace than structured data. Object storage technologies have facilitated the storage of all data types in a cheaper, more scalable, and reliable manner, and this has largely been responsible for the increased support of a large variety of use cases. This has led to a spike in deep learning models. Typical use cases include the following:

Image classification
Voice recognition
Anomaly detection
Recommendation engine
Sentiment analysis
Video analysis

Spark supports the image format as well as the binary format. The image format has a few limitations around decoding image files during the creation of the DataFrame...

Data mashups using Delta

Data mashup refers to combining different datasets to provide a unified data view for analytics. A simple example of this is a BI dashboard combining a consumer's interaction with a brand. The browsing and purchase history are transactional structured data; log data is semi-structured data; the tweets, support cases around product inquiry or complaint; social media posts and comments including text and images are examples of unstructured data that can provide insights into the voice of the customer and user sentiment.

The important elements can be extracted, aggregated, predicted, and brought together with actual transactional data to predict the consumer's next move. The ability to use SQL to query complex aspects of unstructured data and join it with a primary key, such as the customer ID, is very powerful and empowers self-service capabilities. Marketing dollars spent on personalized advertisements and product recommendations can then be purposefully...

Facilitating data sharing with Delta

JDBC/ODBC connections or HTTP connections via REST APIs are good for sharing modest data but may become a bottleneck for larger datasets. Consider the scenario of sharing curated data with external vendors or partners. There are some firms whose business model is centered around data sharing, such as S&P, Bloomberg, FactSet, Nasdaq, and SafeGraph. They aim to be the source of truth for financial datasets, which every other financial institution will be interested in consuming for downstream analysis and to augment their own datasets. Wouldn't it be nice not to have to copy the data multiple times?

It is best to use cloud storage access directly to avoid unnecessary platform-related bottlenecks. That is what Delta sharing attempts to do – provide an open standard to securely and seamlessly share large volumes of data in Parquet/Delta with a wide variety of consumers and an easy way to govern and audit. Consumers can be from pandas...

Summary

DaaS is a natural extension to software as a service (SaaS), where a data product is made available to a qualified user on demand in a self-service style. Organizations value quality data insights and hence are willing to trade them for tangible benefits. Data piracy and leaks are challenges that need to be considered thoroughly as one significant breach could potentially finish an organization. This is not restricted to just structured data and applies to any data type and at any scale, which makes it an exceedingly hard problem.

In this chapter, we looked at Delta's capabilities around handling more complex unstructured data and how to integrate it with the rest of the structured data so that there is not only additional value from it, but it also democratizes data by allowing ubiquitous access via SQL or other languages. We looked at the importance of data harmonization in the context of governance and a single unified view of normalized, quality data for organization...