Chapter 1: An Introduction to Streaming Data
Streaming analytics is one of the new hot topics in data science. It proposes an alternative framework to the more standard batch processing, in which we are no longer dealing with datasets on a fixed time of treatment, but rather we are handling every individual data point directly upon reception.
This new paradigm has important consequences for data engineering, as it requires much more robust and, particularly, much faster data ingestion pipelines. It also imposes a big change in data analytics and machine learning.
Until recently, machine learning and data analytics methods and algorithms were mainly designed to work on entire datasets. Now that streaming has become a hot topic, it becomes more and more common to see use cases in which entire datasets just do not exist anymore. When a continuous stream of data is being ingested into a data storage source, there is no natural moment to relaunch an analytics batch job.
Streaming analytics and streaming machine learning models are models that are designed to work specifically with streaming data sources. A part of the solution, for example, is in the updating. Streaming analytics and machine learning need to update all the time as new data is being received. When updating, you may also want to forget the much older data.
This and other problems that are introduced by moving from batch analytics to streaming analytics need a different approach to analytics and machine learning. This book will lay out the basis for getting you started with data analytics and machine learning on data that is received as a continuous stream.
In this first chapter, you'll get a more solid understanding of the differences between streaming and batch data. You'll see some example use cases that showcase the importance of working with streaming rather than converting back into batch. You'll also start working with a first Python example to get a feel for the type of work that you'll be doing throughout this book.
In later chapters, you'll see some more background notions on architecture and, then, you'll go into a number of data science and analytics use cases and how they can be adapted to the new streaming paradigm.
In this chapter, you will discover the following topics:
- A short history of data science
- Working with streaming data
- Real-time data formats and importing an example dataset in Python
Technical requirements
You can find all the code for this book on GitHub at the following link: https://github.com/PacktPublishing/Machine-Learning-for-Streaming-Data-with-Python. If you are not yet familiar with Git and GitHub, the easiest way to download the notebooks and code samples is the following:
- Go to the link of the repository.
- Go to the green Code button.
- Select Download ZIP:
When you download the ZIP file, you unzip it in your local environment, and you will be able to access the code through your preferred Python editor.
Setting up a Python environment
To follow along with this book, you can download the code in the repository and execute it using your preferred Python editor.
If you are not yet familiar with Python environments, I would advise you to check out Anaconda (https://www.anaconda.com/products/individual), which comes with the Jupyter Notebook and JupyterLab, which are both great for executing notebooks. It also comes with Spyder and VSCode for editing scripts and programs.
If you have difficulty installing Python or the associated programs on your machine, you can check out Google Colab (https://colab.research.google.com/) or Kaggle Notebooks (https://www.kaggle.com/code), which both allow you to run Python code in online notebooks for free, without any setup to do.
Note
The code in the book will generally use Colab and Kaggle Notebooks with Python version 3.7.13 and you can set up your own environment to mimic this.
A short history of data science
Over the last few years, new technology domains have quickly taken over a lot of parts of the world. Machine learning, artificial intelligence, and data science are new fields that have entered our daily life, both in our personal lives and in our professional lives.
The topics that data scientists work on today are not new. The absolute foundation of the field is in mathematics and statistics, two fields that have existed for centuries. As an example, least squares regression was first published in 1805. With time, mathematicians and statisticians have continued working on finding other methods and models.
In the following timeline, you can see how the recent boom in technology has been able to take place. In the 1600s and 1700s, very smart people were already laying the foundations for what we still do in statistics and mathematics today. However, it was not until the invention and popularization of computing power that the field became booming.
Personal computer and internet accessibility is an important reason for data science's popularity today. Almost everyone has a computer that is performant enough for fairly complex machine learning. This strongly helps computer literacy, but also, online documentation accessibility is a big booster for learning.
The availability of big data tools such as Hadoop and Spark is also an important part of the popularization of data science, as they allow practitioners to work with datasets that are larger than anyone could ever imagine before.
Lastly, cloud computing is allowing data scientists from all over the world to access very powerful hardware at low prices. Especially for big data tools, the hardware needed is still priced in a way that most students would not be able to buy it for training purposes. Cloud computing gives access to those use cases for many.
In this book, you will learn how to work with streaming data. It is important to have this short history of data science in mind, as streaming data is one of those technologies that has been disadvantaged by the need for difficult hardware and setup requirements. Streaming data is currently gaining popularity quickly in many domains and has the potential to be a big hit in the coming period. Let's now have a deeper look into the definition of streaming data.
Working with streaming data
Streaming data is data that is streamed. You may know the term streaming from online video services on which you can stream video. When doing this, the video streaming service will continue sending the next parts of the video to you while you are already watching the first part of the video.
The concept is the same when working with streaming data. The data format is not necessarily video and can be any data type that is useful for your use case. One of the most intuitive examples is that of an industrial production line, in which you have continuous measurements from sensors. As long as your production line doesn't pause, you will continue to generate measurements. We will check out the following overview of the data streaming process:
The important notion is that you have a continuous flow of data that you need to treat in real time. You cannot wait until the production line stops to do your analysis, as you would need to detect potential problems right away.
Streaming data versus batch data
Streaming data is generally not among the first use cases that new data scientists tend to start with. The type of problem that is usually introduced first is batch use cases. Batch data is the opposite of streaming data, as it works in phases: you collect a bunch of data, and then you treat a bunch of data.
If you see streaming data as streaming a video online, you could see batch data as downloading the entire video first and then watching it when the downloading is finished. For analytical purposes, this would mean that you get the analysis of a bunch of data when the data generating process is finished rather than whenever a problem occurs.
For some use cases, this is not a problem. Yet, you can understand that streaming can deliver great added value in those use cases where fast analytics can have an impact. It also has added value in use cases where data is ingested in a streaming method, which is becoming more and more common. In practice, many use cases that would get added value through streaming are still solved with batch treatment, just because these methods are better known and more widespread.
The following overview shows the batch treatment process:
Advantages of streaming data
Let's now look at some advantages of using streaming analytics rather than other approaches in the following subsections.
Data generating processes are in real time
The first advantage of building streaming data analytics rather than batch systems is that many data generating processes are actually in real time. You will discover a number of use cases later, but in general, it is rare that data collection is done in batches.
Although most of us are used to building batch systems around real-time data generating systems, it often makes more sense to build streaming analytics directly.
Of course, batch analytics and streaming analytics can co-exist. Yet, adding a batch treatment to a streaming analytics service is often much easier than adding streaming functionality into a system that is designed for batches. It simply makes the most sense to start with streaming.
Real-time insights have value
When designing data science solutions, streaming does not always come to mind first. However, when solutions or tools are built in real time, it is rare that the real-time functionality is not appreciated.
Many analytical solutions of today are built in real time and the tools are available. In many problems, real-time information will be used at some point. Maybe it will not be used from the start, but the day that anomalies happen, you will find a great competitive advantage in having the analytics straight away, rather than waiting till the next hour or the next morning.
Examples of successful implementation of streaming analytics
Let's talk about some examples of companies that have implemented real-time analytics successfully. The first example is Shell. They have been able to implement real-time analytics of their security cameras on their gas stations. An automated and real-time machine learning pipeline is able to detect whether people are smoking.
Another example is the use of sensor data in connected sports equipment. By measuring heart rate and other KPIs in real time, they are able to alert you when anything is wrong with your body.
Of course, the big players such as Facebook and Twitter also analyze a lot of data in real time, for example, when detecting fake news or bad content. There are many successful use cases of streaming analytics, yet at the same time, there are some common challenges that streaming data brings with them. Let's have a look at them now.
Challenges of streaming data
Streaming data analytics are currently less widespread than batch data analytics. Although this is slowly changing, it is good to understand where the challenges are when working with streaming data.
Knowledge of streaming analytics
One simple reason for streaming analytics being less widespread is a question of knowledge and know-how. Setting up streaming analytics is often not taught in schools and is definitely not taught as the go-to method. There are also fewer resources available on the internet to get started with it. As there are much more resources on machine learning and analytics for batch treatment, and the batch methods do not apply to streaming data, people tend to start with batch applications for data science.
Understanding the architecture
A second difficulty when working on streaming data is architecture. Although some data science practitioners have knowledge of architecture, data engineering, and DevOps, this is not always the case. To set up a streaming analytics proof of concept or a minimum viable product (MVP), all those skills are needed. For batch treatment, it is often enough to work with scripts.
Architectural difficulties are inherent to streaming, as it is necessary to work with real-time processes that send individually collected records to an analytical treatment process that will update in real time. If there is no architecture that can handle this, it does not make much sense to start with streaming analytics.
Financial hurdles
Another challenge when working with streaming data is the financial aspect. Although working with streaming is not necessarily more expensive in the long run, it can be more expensive to set up the infrastructure needed to get started. Working on a local developer PC for an MVP is unlikely to succeed as the data needs to be treated in real time.
Risks of runtime problems
Real-time processes also have a larger risk of runtime problems. When building software, bugs and failures happen. If you are on a daily batch process, you may be able to repair the process, rerun the failed batch, and solve the problem.
If a streaming tool is down, there are risks of losing data. As the data should be ingested in real time, the data that is generated during a time-out of your process may not be recuperable. If your process is very important, you will need to set up extensive monitoring day and night and have more quality checks before pushing your solutions to production. Of course, this is also important in batch processes, but even more so in streaming.
Smaller analytics (fewer methods easily available)
The last challenge of streaming analytics is that the common methods are generally developed for batch data first. There are currently many solutions out there for analytics on real time and streaming data, but still not as many as for batch data.
Also, since the streaming analysis has to be done very quickly to respect real-time delivery, streaming use cases tend to end up with much less interesting analytical methodologies and stay at the basic level of descriptive or basic analyses.
How to get started with streaming data
For companies to get started with streaming data, the first step is often to start by putting in place simple applications that collect real-time data and make that real-time data accessible in real time. Common use cases to start with are log data, website visits data, or sensor data.
A next step would often be to build reporting tools on top of the real-time data source. You can think about KPI dashboards that update in real time, or small and simple alerting tools based on high or low threshold values based on business rules.
When such systems are in place, this leads the way to replace those business rules, or add on top of them. You can think about more advanced analytics tools including real-time machine learning for anomaly detection and more.
The most complex step is to add automated feedback loops between your real-time machine learning and your process. After all, there is no reason to stop at analytics for business insights if there is potential to automate and improve decision-making as well.
Common use cases for streaming data
Let's see a few of the most common use cases for streaming data so that you can get a better feel of the use cases that can benefit from streaming techniques. This will cover three use cases that are relatively accessible for anyone, but of course, there are many more.
Sensor data and anomaly detection
A common use case for streaming data is the analysis of sensor data. Sensor data can occur in a multitude of use cases, such as industry production lines and IoT use cases. When companies decide to collect sensor data, it is often treated in real time.
For a production line, there is great value in detecting anomalies in real time. When too many anomalies occur, the production line can be shut down or the problem can be solved before a number of faulty products are delivered.
A good example of streaming analytics for monitoring humidity for artwork can be found here: https://azure.github.io/iot-workshop-asset-tracking/step-003-anomaly-detection/.
Finance and regression forecasting
Finance data is another great use case for streaming data. For example, in the world of stock trading, timing is important. The faster you can detect up or downtrends in the stock market, the faster a trader (or algorithm) can react by selling or buying stocks and making money.
A great example is described in the following paper by K.S Umadevi et al (2018): https://ieeexplore.ieee.org/document/8554561.
Clickstream for websites and classification
Websites or apps are a third common use case for real-time insights. If you can track and analyze your visitors in real time, you can propose a personalized experience for them on your website. By proposing products or services that match with a website visitor, you can increase your online sales.
The following paper by Ramanna Hanamanthrao and S Thejaswini (2017) gives a great use case for this technology applied to clickstream data: https://ieeexplore.ieee.org/abstract/document/8256978.
Streaming versus big data
It is important to understand different definitions of streaming that you may encounter. One distinction to make is between streaming and big data. Some definitions will consider streaming mainly in a big data (Hadoop/Spark) context, whereas others do not.
Streaming solutions often have a large volume of data, and big data solutions can be the appropriate choice. However, other technologies, combined with a well-chosen hardware architecture, may also be able to do the analytics in real time and, therefore, build streaming solutions without big data technologies.
Streaming versus real-time inference
Real-time inference of models is often built and made accessible via an API. As we define streaming as the analysis of data in real time without batches, such predictions in real time can be considered streaming. You will see more about real-time architectures in a later chapter.
Real-time data formats and importing an example dataset in Python
To finalize this chapter, let's have a look at how to represent streaming data in practice. After all, when building analytics, we will often have to implement test cases and example datasets.
The simplest way to represent streaming data in Python would be to create an iterable object that contains the data and to build your analytics function to work with an iterable.
The following code creates a DataFrame using pandas. There are two columns, temperature and pH:
Code block 1-1
import pandas as pd
data_batch = pd.DataFrame({
'temperature': [10, 11, 10, 11, 12, 11, 10, 9, 10, 11, 12, 11, 9, 12, 11],
‹pH›: [5, 5.5, 6, 5, 4.5, 5, 4.5, 5, 4.5, 5, 4, 4.5, 5, 4.5, 6]
})
print(data_batch)
When showing the DataFrame, it will look as follows. The pH is around 4.5/5 but is sometimes higher. The temperature is generally around 10 or 11.
This dataset is a batch dataset; after all, you have all the rows (observations) at the same time. Now, let's see how to convert this dataset to a streaming dataset by making it iterable.
You can do this by iterating through the data's rows. When doing this, you set up a code structure that allows you to add more building blocks to this code one by one. When your developments are done, you will be able to use your code on a real-time stream rather than on an iteration of a DataFrame.
The following code iterates through the rows of the DataFrame and converts the rows to JSON format. This is a very common format for communication between different systems. The JSON of the observation contains a value for temperature and a value for pH. Those are printed out as follows:
Code block 1-2
data_iterable = data_batch.iterrows()
for i,new_datapoint in data_iterable:
print(new_datapoint.to_json())
After running this code, you should obtain a print output that looks like the following:
Let's now define a super simple example of streaming data analytics. The function that is defined in the following code block will print an alert whenever the temperature gets below 10:
Code block 1-3
def super_simple_alert(datapoint):
if datapoint[‹temperature›] < 10:
print('this is a real time alert. temp too low')
You can now add this alert into your simulated streaming process simply by calling the alerting test at every data point. You can use the following code to do this:
Code block 1-4
data_iterable = data_batch.iterrows()
for i,new_datapoint in data_iterable:
print(new_datapoint.to_json())
super_simple_alert(new_datapoint)
When executing this code, you will notice that alerts will be given as soon as the temperature goes below 10:
This alert works only on the temperature, but you could easily add the same type of alert on pH. The following code shows how this can be done. The alert function could be updated to include a second business rule as follows:
Code block 1-5
def super_simple_alert(datapoint):
if datapoint[‹temperature›] < 10:
print('this is a real time alert. temp too low')
if datapoint[‹pH›] > 5.5:
print('this is a real time alert. pH too high')
Executing the function would still be done in exactly the same way:
Code block 1-6
data_iterable = data_batch.iterrows()
for i,new_datapoint in data_iterable:
print(new_datapoint.to_json())
super_simple_alert(new_datapoint)
You will see several alerts being raised throughout the execution on the example streaming data, as follows:
With streaming data, you have to decide without seeing the complete data but just on those data points that have been received in the past. This means that there is a need for a different approach to redeveloping algorithms that are similar to batch processing algorithms.
Throughout this book, you will discover methods that apply to streaming data. The difficulty, as you may understand, is that a statistical method is generally developed to compute things using all the data.
Summary
In this introductory chapter on streaming data and streaming analytics, you have first seen some definitions of what streaming data is, and how it is opposed to batch data processing. In streaming data, you need to work with a continuous stream of data, and more traditional (batch) data science solutions need to be adapted to make things work with this newer and more demanding method of data treatment.
You have seen a number of example use cases, and you should now understand that there can be much-added value for businesses and advanced technology use cases to have data science and analytics calculated on the fly rather than wait for a fixed moment. Real-time insights can be a game-changer, and autonomous machine learning solutions often need real-time decision capabilities.
You have seen an example in which a data stream was created and a simple real-time alerting system was developed. In the next chapter, you will get a much deeper introduction to a number of streaming solutions. In practice, data scientists and analysts will generally not be responsible for putting streaming data ingestion in place, but they will be constrained by the limits of those systems. It is, therefore, important to have a good understanding of streaming and real-time architecture: this will be the goal of the next chapter.
Further reading
- What is streaming data? (by AWS): https://aws.amazon.com/streaming-data/
- The 8 Best Examples of Real-Time Data Analytics, by Bernard Marr: https://www.linkedin.com/pulse/8-best-examples-real-time-data-analytics-bernard-marr/
- How to Build a Strong Business Case For Streaming Analytics, Forbes https://www.forbes.com/sites/forbestechcouncil/2021/10/26/how-to-build-a-strong-business-case-for-streaming-analytics/?sh=314e2b8a465d
- 7 enterprise use cases for real-time streaming analytics: https://searchbusinessanalytics.techtarget.com/feature/7-enterprise-use-cases-for-real-time-streaming-analytics
- From batch to online/stream, by RiverML: https://riverml.xyz/dev/examples/batch-to-online/
- Anaconda: https://www.anaconda.com/products/individual
- Google Colab: https://colab.research.google.com/
- Kaggle Notebooks: https://www.kaggle.com/code