Reader small image

You're reading from  Redis Stack for Application Modernization

Product typeBook
Published inDec 2023
PublisherPackt
ISBN-139781837638185
Edition1st Edition
Right arrow
Authors (2):
Luigi Fugaro
Luigi Fugaro
author image
Luigi Fugaro

Luigi Fugaro's first encounter with computers was in the early 80s when he was a kid. He started with a Commodore Vic-20, passing through a Sinclair, a Commodore 64, and an Atari ST 1040, where he spent days and nights giving breath mints to Otis. In 1998, he started his career as a webmaster doing HTML, JavaScript, Applets, and some graphics with Paint Shop Pro. He then switched to Delphi, Visual Basic, and then started working on Java projects. He has been developing all kinds of web applications, dealing with backend and frontend frameworks. In 2012, he started working for Red Hat and is now an architect in the EMEA Middleware team. He has authored WildFly Cookbook and Mastering JBoss Enterprise Application Platform 7 by Packt Publishing.
Read more about Luigi Fugaro

Mirko Ortensi
Mirko Ortensi
author image
Mirko Ortensi

Mirko Ortensi earned a degree in Electronic Engineering and a Master's degree in Software Engineering. Mirko's career has spanned several roles from Software Engineering to Customer Support, particularly centered around distributed database systems. As a Senior Technical Enablement Architect at Redis, Mirko shares technical knowledge about Redis's products and services.
Read more about Mirko Ortensi

View More author details
Right arrow

Understanding Probabilistic Data Structures

The probabilistic data structures of Redis Stack are packed into a set of capabilities also known as Bloom filters. Such structures owe their name to Burton Howard Bloom, a computer scientist who introduced the concept of a probabilistic data structure in 1970 to resolve the problem of verifying whether an item belongs to a set. By using hash data representations, it is possible to achieve a sufficient approximation to the problem under analysis, allowing false positives (the item may belong to the set), but without false negatives (the item definitely does not belong to the set).

The Bloom filter has since become a widely used data structure in computer science. It is used in various applications, such as spell-checking, network routing, content filtering, and DNA sequence analysis.

Probabilistic data structures process large volumes of data in real time with minimal memory requirements. This chapter covers several types of probabilistic...

Technical requirements

To follow along with the examples in this chapter, you will need to install Redis Stack Server 7.2 or later in your development environment. Alternatively, you can create a free Redis Cloud subscription to get a free plan and use a managed Redis Stack database.

HyperLogLog

HyperLogLog was the first probabilistic data structure available in Redis, theorized and published in 2003 by Marianne Durand and Philippe Flajolet. HyperLogLog is an efficient solution to count the number of unique occurrences in a set, which, in practice, is the cardinality of the set. This is especially useful when accuracy is not required: the count is probabilistic but presents a very low error. HyperLogLog does not store any information as the items that are added pass through a hashing function, so it is not possible to remove elements once they are counted.

HyperLogLog is the simplest probabilistic data structure and using it is as easy as adding elements to it using the PFADD command:

PFADD mortensi.com:1hnsn9n6kb:012023 IGaJ9c5KqYHFUEogCQWc
PFADD mortensi.com:1hnsn9n6kb:012023 XUq9br38-SYFOVpfN-vq
PFADD mortensi.com:1sneb4qq3t:012023 bmGteHMoGpAKbtk4X-9Y
PFADD mortensi.com:1sneb4qq3t:022023 IGaJ9c5KqYHFUEogCQWc

Using the former commands, we can track the...

Bloom filter

A Bloom filter is one of the probabilistic data structures supported by Redis Stack and is used to test whether an item is a member of a set. It is crucial as a data deduplication solution – that is, for removing duplicated data from a set. It is a memory-efficient and fast data structure that uses a bit array and a set of hash functions to determine whether an item is in the set or not. Testing for membership to the filter can return “possibly in the set” or “definitely not in the set,” which means that false positives are possible, but false negatives are not. Imprecisions (or approximations) are around the corner in every aspect of life, and digital computing does not make any difference. Think of the lossy compression algorithms for images (JPEG) or audio files (MP3): we can still enjoy media files and not even realize there is a loss of quality. A Bloom filter simplifies the management and speed of solutions that require the existence...

Cuckoo filters

Cuckoo filters are an evolution of Bloom filters that were published in 2014 and address the limitations of Bloom filters, especially around collision handling. This filter inherits its name from the cuckoo bird, famous for laying its eggs in the nests of other bird species and leaving them to be raised by the host bird. In doing so, the cuckoo pushes the other eggs or chicks out of the nest. This behavior describes the implementation of Cuckoo filters. Differently from Bloom filters, Cuckoo filters use a fingerprint-based technique that allows for the fast and efficient handling of collisions and reduces the rate of false positives while maintaining the same space requirements as Bloom filters.

Instead of storing the hashed version of an item as Bloom filters do, Cuckoo filters use multiple locations, or buckets, to store the fingerprint representation of the item. When a new item is added to the filter and a collision occurs at a candidate bucket, the existing item...

Count-Min sketch

The Count-Min sketch probabilistic data structure, like HyperLogLog, counts the items that have been added, with the difference that the Count-Min sketch counts the number of times specific items have been added – that is, their frequency.

When using a Count-Min sketch data structure, any frequency counts below a predetermined threshold (established by the error rate) should be disregarded. The Count-Min sketch serves as a valuable tool for counting element frequencies in a data stream, especially when dealing with higher counts. Nevertheless, very low counts are often perceived as noise and are typically discarded in this context. To start using the data structure, we have the option to initialize it either based on the probabilities to be maintained or on the desired dimensions. It is important to note that the dimensions of the Count-Min sketch play a significant role because to merge two Count-Min sketches, they must have identical dimensions.

We can...

Top-K

The Top-K data structure is used to keep track of items with the highest rank, such as the top players in a leaderboard. The ranking, or score, is often based on the count of how many times an item appears in the data source (such as a stream), making the data structure ideal for identifying elements with the highest frequency. Among the most common use cases of this data structure are leaderboards, trending entities in a system, detecting network anomalies, and DDoS attacks. Here, the Top-K data structure can help answer questions such as “Which top addresses or IPs have the highest surge in the flow of requests?”

Let’s dive into an example of using the Top-K data structure and insert a few items into it. First, we must initialize it using the following command:

TOPK.RESERVE key topk [width depth decay]

In addition to key, which specifies the Top-K name, topk indicates the number of top items we want to keep track of, and width indicates the number...

t-digest

t-digest is a data structure for estimating quantiles from a data stream or a large dataset using a compact sketch.

The t-digest data structure enables the resolution of various inquiries, such as “What proportion of values in the data stream is less than a specific value?” and “How many values in the data stream are below a given threshold?” To better understand t-digest, we need to define quantiles and percentiles.

A quantile is a value or cut point that divides a dataset into intervals with equal proportions or frequencies of observations. As an example, the median is an example of a quantile as it divides the dataset in half (that is, 50% of observations below and 50% above).

A percentile represents a specific position within a dataset, where a certain percentage of the data falls below that position. For example, if a value is at the 75th percentile of a dataset, it means that 75% of the data falls below that value. Percentiles are...

Summary

In this chapter, we introduced probabilistic data structures, also known as “sketches.” We explained their strengths in different areas, such as fraud detection, online gaming, network, device analysis, and social media trends analysis. We also learned how to use the right data structures for the right use cases and understood that using such solutions is often an acceptable compromise between accuracy, performance, and the usage of resources, especially when dealing with datasets of exceptional size or that contain big data. In addition, we discussed how accuracy can be fine-tuned and the guarantees of outcomes when a certain item is tested against a specific sketch. To learn more about probabilistic data structures and understand how to achieve acceptable accuracy, refer to the Sizing section in the documentation at https://redis.io/docs/data-types/probabilistic/.

In Chapter 9, The Programmability of Redis Stack, we will dive into the strengths of Redis Stack...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Redis Stack for Application Modernization
Published in: Dec 2023Publisher: PacktISBN-13: 9781837638185
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Luigi Fugaro

Luigi Fugaro's first encounter with computers was in the early 80s when he was a kid. He started with a Commodore Vic-20, passing through a Sinclair, a Commodore 64, and an Atari ST 1040, where he spent days and nights giving breath mints to Otis. In 1998, he started his career as a webmaster doing HTML, JavaScript, Applets, and some graphics with Paint Shop Pro. He then switched to Delphi, Visual Basic, and then started working on Java projects. He has been developing all kinds of web applications, dealing with backend and frontend frameworks. In 2012, he started working for Red Hat and is now an architect in the EMEA Middleware team. He has authored WildFly Cookbook and Mastering JBoss Enterprise Application Platform 7 by Packt Publishing.
Read more about Luigi Fugaro

author image
Mirko Ortensi

Mirko Ortensi earned a degree in Electronic Engineering and a Master's degree in Software Engineering. Mirko's career has spanned several roles from Software Engineering to Customer Support, particularly centered around distributed database systems. As a Senior Technical Enablement Architect at Redis, Mirko shares technical knowledge about Redis's products and services.
Read more about Mirko Ortensi