Reader small image

You're reading from  Distributed Data Systems with Azure Databricks

Product typeBook
Published inMay 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781838647216
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Alan Bernardo Palacio
Alan Bernardo Palacio
author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio

Right arrow

Summary

Implementing a data lake is a paradigm change within an organization. Delta Lake provides a solution for this when we are dealing with streams of data from different sources, when the schema of the data might change over time, and when we need to have a system that is reliable against data mishandling and easy to audit.

Delta Lake fills the gap between the functionality of a data warehouse and the benefits of a data lake while also overcoming most of its challenges.

Schema validation ensures that our ETL pipelines maintain reliability against changes in the tables. It informs us of this by raising an exception if any mismatches arise and the data becomes contaminated. If the change was intentional, we can use schema evolution.

Time travel allows us to access historic versions of data, thanks to its ordered transaction log. This keeps track of every operation that's performed in Delta tables. This is useful when we need to define pipelines that need to query different...

lock icon
The rest of the page is locked
Previous PageNext Chapter
You have been reading a chapter from
Distributed Data Systems with Azure Databricks
Published in: May 2021Publisher: PacktISBN-13: 9781838647216

Author (1)

author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio