Reader small image

You're reading from  Simplifying Data Engineering and Analytics with Delta

Product typeBook
Published inJul 2022
PublisherPackt
ISBN-139781801814867
Edition1st Edition
Concepts
Right arrow
Author (1)
Anindita Mahapatra
Anindita Mahapatra
author image
Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra

Right arrow

Preface

Delta helps you generate reliable insights at scale and simplifies architecture around data pipelines, allowing you to focus primarily on refining the use cases being worked upon. This is especially important considering the same architecture is reused when onboarding new use cases.

In this book, you'll learn the principles of distributed computing, data modeling techniques, big data design patterns, and templates that help solve end-to-end data flow problems for common scenarios and are reusable across use cases and industry verticals. You'll also learn how to recover from errors and the best practices around handling structured, semi-structured, and unstructured data using Delta. Next, you'll get to grips with features such as ACID transactions on big data, disciplined schema evolution, time travel to help rewind a dataset to a different time or version, and unified batch and streaming capabilities that will help you build agile and robust data products.By the end of this book, you'll be able to use Delta as the foundational block for creating analytics-ready data that fuels all AI/BI use cases.

Who this book is for

Individuals in the data domain such as data engineers, data scientists, ML practitioners, and BI analysts working with big data will be able to put their knowledge to work with this practical guide to executing pipelines and supporting diverse use cases using the Delta protocol. Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book.

What this book covers

Chapter 1, Introduction to Data Engineering, covers how data is the new oil. Just as oil has to burn to get heat and light, data also has to be harnessed to get valuable insights. The quality of insights will depend on the quality of the data. So, understanding how to manage data is an important function for every industry vertical. This chapter introduces the fundamental principles of data engineering and addresses the growing trends in the industry of data-driven organizations and how to leverage IT data operation units as a competitive advantage instead of viewing them as a cost center.

Chapter 2, Data Modeling and ETL, covers how leveraging the scalability and elasticity of the cloud helps turn on compute on demand and move CAPEX allocation towards OPEX. This chapter introduces common big data design patterns and best practices for modeling big data.

Chapter 3, Delta – The Foundational Block for Big Data, introduces Delta as a file format and points out features that Delta brings to the table over vanilla Parquet and why it is a natural choice for any pipeline. Delta is an overloaded term – it is a protocol first, a table next, and a lake finally!

Chapter 4, Unifying Batch and Streaming with Delta, covers how the trend is towards real-time ingestion, analysis, and consumption of data. Batching is actually a type of streaming workload. Reader/writer isolation is necessary in an environment with multiple producers/consumers involving the same data assets to work independently with the promise that bad or partial data is never presented to the user.

Chapter 5, Data Consolidation in Delta Lake, covers how bringing data together from various silos is only the first step towards building a data lake. The real deal is in increased reliability, quality, and governance, which needs to be enforced to get the most out of the data and infrastructure investment while adding value to any BI or AI use case built on top of it.

Chapter 6, Solving Common Data Pattern Scenarios with Delta, covers common CRUD operations on big data and looks at use cases where they can be applied as a repeatable blueprint.

Chapter 7, Delta for Data Warehouse Use Cases, covers the journey from databases to data warehouses to data lakes, and finally, to lakehouses. The unification of data platforms has never been more important. Is it possible to house all kinds of use cases with a single architecture paradigm? This chapter focuses on the data handling needs and capability requirements that drive the next round of innovation.

Chapter 8, Handling Atypical Data Scenarios with Delta, covers several conditions, such as data imbalance, skew, and bias, that need to be addressed to ensure data is not only cleansed and transformed per the business requirements but is also conducive to the underlying compute and for the use case at hand. Even when the logic of the pipelines has been ironed out, there are other statistical attributes of the data that need to be addressed to ensure that the data characteristics for which it was initially designed still hold and make the most of the distributed compute.

Chapter 9, Delta for Reproducible Machine Learning Pipelines, emphasizes that if ML is hard, then reproducible ML and productionizing of ML is even harder. A large part of ML is data preparation. The quality of insights will be as good as the quality of the data that is used to build the models. In this chapter, we look at the role of Delta in ensuring reproducible ML.

Chapter 10, Delta for Data Products and Services, covers consumption patterns of data democratization that ensure the curated data gets into the hands of the consumers in a timely and secure manner so that the insights can be leveraged meaningfully. Data can be served both as a product and a service, especially in the context of a mesh architecture involving multiple lines of businesses specializing in different domains.

Chapter 11, Operationalizing Data and ML Pipelines, looks at the aspects of a mature pipeline that make it considered production worthy. A lot of the data around us remains in unstructured form and carries a wealth of information, and integrating it with more structured transactional data is where firms can not only get competitive intelligence but also begin to get a holistic view of their customers to employ predictive analytics.

Chapter 12, Optimizing Cost and Performance with Delta, looks at how running a pipeline faster has cost implications that translate directly to increased infrastructural savings. This applies to both the ETL pipeline that brings in the data and curates it as well as the consumption pipeline where the stakeholders tap into this curated data. In this chapter, we look at strategies such as file skipping, z-ordering, small file coalescing, and bloom filtering to improve query runtime.

Chapter 13, Managing Your Data Journey, emphasizes the need for policies around data access and data use that need to be honored as per regulatory and compliance guidelines. In some industries, it may be necessary to provide evidence of all data access and transformations. Hence, there is a need to be able to set controls in place, detect if something has been changed, and provide a transparent audit trail.

To get the most out of this book

Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book. Delta is open source and can be run both on-prem and in the cloud. Because of the rise in cloud data platforms, a lot of the descriptions and examples are in the context of cloud storage.

Use the following GitHub link for the Delta Lake documentation and quickstart guide to help you set up your environment and become familiar with the necessary APIs: https://github.com/delta-io/delta.

Databricks is the original creator of Delta, which was open sourced to the Linux Foundation and is supported by a large user community. Examples in this book cover some Databricks-specific features to provide a complete view of features and capabilities. Newer features continue to be ported from Databricks to open source Delta. Please refer to the proposed roadmap for the feature migration details: https://github.com/delta-io/delta/issues/920.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta.

If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/UI11F.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "There is no need to run the REPAIR TABLE command when you're working with the Delta format".

A block of code is set as follows:

SELECT COUNT(*) FROM some _ parquet _ table

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. On the other hand, a data swamp is a large body of data that is ungoverned and unreliable.

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Simplifying Data Engineering and Analytics with Delta, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Simplifying Data Engineering and Analytics with Delta
Published in: Jul 2022Publisher: PacktISBN-13: 9781801814867
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra