At some point in time we might all have used insurance quotes. To get insurance quotes for a car we fill in the details about us and based on our credit history and other details the application gives you the insurance quotes in real time. This application analyzes your data in real time and based on it predicts the quotes. For years, these applications have followed mostly rule-based approaches with a powerful rule engine running behind the scenes, more recently these applications have started using machine learning to analyze data further and make predictions at that point in time. All these predictions and analysis that happen at that instance or point in time are real-time analytics. Some of the most popular websites, such as Netflix or famous ad networks, are all using real-time analytics and with the coming of new devices as part of the Internet of things or IoT wave, collection and analysis of data in real time has become the need of the...
You're reading from Big Data Analytics with Java
As is evident from the name, real-time analytics provides analysis and their results in real time. Big data has mostly been used in batch mode where the queries on top of the data run for a long time and the result is later analysed. The approach is changing lately, mainly due to the new requirements pertaining to certain use cases that require immediate results. Real-time requires a separate set of architecture that caters to not only data collection and data parsing, but also data analyzing at the same time.
Let's try to understand the concept of real-time analytics using the following diagram:
As you can see, today the sources of data are plenty whether it's mobile devices, websites, third-party applications, or even the Internet of Things (sensors). All this data needs a way to propagate and flow from the source of their devices to the central unit where the data can be parsed, cleaned, and finally ingested. It is at this ingestion time that the data can also be analyzed...
In this chapter, we learnt about real-time analytics and saw how big data can be used in real-time analytics apart from batch processing too. We introduced the product Impala that can be used to fire fast SQL queries on big data which is usually stored in Parquet format in HDFS. While looking at Impala we briefly did a simple case study on flight analytics using Impala. We later covered Apache Kafka a messaging product that can be used in conjunction with big data technologies and build real time data stacks. Kafka is a scalable messaging solution and we showed how it can be integrated with Spark Streaming module of Apache Spark. Spark Streaming let's you collect data in mini batches in real time and it calls sequence of these mini batches as streams. Spark Streaming is becoming very popular these days as it is a good scalable solution that fits into the needs of many users. We finally covered a few cases studies using Apache Kafka and Spark Streaming and showed how complex use cases...