Reader small image

You're reading from  Distributed Data Systems with Azure Databricks

Product typeBook
Published inMay 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781838647216
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Alan Bernardo Palacio
Alan Bernardo Palacio
author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio

Right arrow

Using Bloom filters

Bloom filters are a way of efficiently filtering records in a database based on a condition. They have a probabilistic nature and are used to test the membership of an element in a set. We can encounter false positives but not false negatives. These filters were developed as a mathematical construct, to be applied when the amount of data to scan is impractical to be read, and are based on hashing techniques.

Delta Lake provides us with the ability to apply Bloom filters on our queries to further improve performance. We will see how they work at a basic level and how they can be applied in Delta Lake.

Understanding Bloom filters

As mentioned in the introduction to this section, Bloom filters are probabilistic data structures used to test if an element belongs to a category or not. This structure is a fixed-length bit array that is populated using a hash function, which maps the information into ones and zeros. The length of the array depends on the number...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Distributed Data Systems with Azure Databricks
Published in: May 2021Publisher: PacktISBN-13: 9781838647216

Author (1)

author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio