You're reading from Simplifying Data Engineering and Analytics with Delta

Product typeBook

Published inJul 2022

PublisherPackt

ISBN-139781801814867

Edition1st Edition

Concepts

Big Data

Author (1)

Anindita Mahapatra

Preface

Delta helps you generate reliable insights at scale and simplifies architecture around data pipelines, allowing you to focus primarily on refining the use cases being worked upon. This is especially important considering the same architecture is reused when onboarding new use cases.

In this book, you'll learn the principles of distributed computing, data modeling techniques, big data design patterns, and templates that help solve end-to-end data flow problems for common scenarios and are reusable across use cases and industry verticals. You'll also learn how to recover from errors and the best practices around handling structured, semi-structured, and unstructured data using Delta. Next, you'll get to grips with features such as ACID transactions on big data, disciplined schema evolution, time travel to help rewind a dataset to a different time or version, and unified batch and streaming capabilities that will help you build agile and robust data products.By the end of this book, you'll be able to use Delta as the foundational block for creating analytics-ready data that fuels all AI/BI use cases.

Who this book is for

Individuals in the data domain such as data engineers, data scientists, ML practitioners, and BI analysts working with big data will be able to put their knowledge to work with this practical guide to executing pipelines and supporting diverse use cases using the Delta protocol. Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book.

What this book covers

Chapter 1, Introduction to Data Engineering, covers how data is the new oil. Just as oil has to burn to get heat and light, data also has to be harnessed to get valuable insights. The quality of insights will depend on the quality of the data. So, understanding how to manage data is an important function for every industry vertical. This chapter introduces the fundamental principles of data engineering and addresses the growing trends in the industry of data-driven organizations and how to leverage IT data operation units as a competitive advantage instead of viewing them as a cost center.

Chapter 2, Data Modeling and ETL, covers how leveraging the scalability and elasticity of the cloud helps turn on compute on demand and move CAPEX allocation towards OPEX. This chapter introduces common big data design patterns and best practices for modeling big data.

Chapter 3, Delta – The Foundational Block for Big Data, introduces Delta as a file format and points out features that Delta brings to the table over vanilla Parquet and why it is a natural choice for any pipeline. Delta is an overloaded term – it is a protocol first, a table next, and a lake finally!

Chapter 4, Unifying Batch and Streaming with Delta, covers how the trend is towards real-time ingestion, analysis, and consumption of data. Batching is actually a type of streaming workload. Reader/writer isolation is necessary in an environment with multiple producers/consumers involving the same data assets to work independently with the promise that bad or partial data is never presented to the user.

Chapter 5, Data Consolidation in Delta Lake, covers how bringing data together from various silos is only the first step towards building a data lake. The real deal is in increased reliability, quality, and governance, which needs to be enforced to get the most out of the data and infrastructure investment while adding value to any BI or AI use case built on top of it.

Chapter 6, Solving Common Data Pattern Scenarios with Delta, covers common CRUD operations on big data and looks at use cases where they can be applied as a repeatable blueprint.

Chapter 7, Delta for Data Warehouse Use Cases, covers the journey from databases to data warehouses to data lakes, and finally, to lakehouses. The unification of data platforms has never been more important. Is it possible to house all kinds of use cases with a single architecture paradigm? This chapter focuses on the data handling needs and capability requirements that drive the next round of innovation.

Chapter 8, Handling Atypical Data Scenarios with Delta, covers several conditions, such as data imbalance, skew, and bias, that need to be addressed to ensure data is not only cleansed and transformed per the business requirements but is also conducive to the underlying compute and for the use case at hand. Even when the logic of the pipelines has been ironed out, there are other statistical attributes of the data that need to be addressed to ensure that the data characteristics for which it was initially designed still hold and make the most of the distributed compute.

Chapter 9, Delta for Reproducible Machine Learning Pipelines, emphasizes that if ML is hard, then reproducible ML and productionizing of ML is even harder. A large part of ML is data preparation. The quality of insights will be as good as the quality of the data that is used to build the models. In this chapter, we look at the role of Delta in ensuring reproducible ML.

Chapter 10, Delta for Data Products and Services, covers consumption patterns of data democratization that ensure the curated data gets into the hands of the consumers in a timely and secure manner so that the insights can be leveraged meaningfully. Data can be served both as a product and a service, especially in the context of a mesh architecture involving multiple lines of businesses specializing in different domains.

Chapter 11, Operationalizing Data and ML Pipelines, looks at the aspects of a mature pipeline that make it considered production worthy. A lot of the data around us remains in unstructured form and carries a wealth of information, and integrating it with more structured transactional data is where firms can not only get competitive intelligence but also begin to get a holistic view of their customers to employ predictive analytics.

Chapter 12, Optimizing Cost and Performance with Delta, looks at how running a pipeline faster has cost implications that translate directly to increased infrastructural savings. This applies to both the ETL pipeline that brings in the data and curates it as well as the consumption pipeline where the stakeholders tap into this curated data. In this chapter, we look at strategies such as file skipping, z-ordering, small file coalescing, and bloom filtering to improve query runtime.

Chapter 13, Managing Your Data Journey, emphasizes the need for policies around data access and data use that need to be honored as per regulatory and compliance guidelines. In some industries, it may be necessary to provide evidence of all data access and transformations. Hence, there is a need to be able to set controls in place, detect if something has been changed, and provide a transparent audit trail.

To get the most out of this book

Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book. Delta is open source and can be run both on-prem and in the cloud. Because of the rise in cloud data platforms, a lot of the descriptions and examples are in the context of cloud storage.

Use the following GitHub link for the Delta Lake documentation and quickstart guide to help you set up your environment and become familiar with the necessary APIs: https://github.com/delta-io/delta.

Databricks is the original creator of Delta, which was open sourced to the Linux Foundation and is supported by a large user community. Examples in this book cover some Databricks-specific features to provide a complete view of features and capabilities. Newer features continue to be ported from Databricks to open source Delta. Please refer to the proposed roadmap for the feature migration details: https://github.com/delta-io/delta/issues/920.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta.

If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/UI11F.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "There is no need to run the REPAIR TABLE command when you're working with the Delta format".

A block of code is set as follows:

SELECT COUNT(*) FROM some _ parquet _ table

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. On the other hand, a data swamp is a large body of data that is ungoverned and unreliable.

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

The rest of the chapter is locked

You have been reading a chapter from

Simplifying Data Engineering and Analytics with Delta

Published in: Jul 2022Publisher: PacktISBN-13: 9781801814867

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Simplifying Data Engineering and Analytics with Delta

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook