What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Real-time Analytics with Storm and Cassandra

Chapter 1. Let's Understand Storm

In this chapter, you will be acquainted with the problems requiring distributed computed solutions and get to know how complex it could get to create and manage such solutions. We will look at the options available to solve distributed computation.

The topics that will be covered in the chapter are as follows:

Getting acquainted with a few problems that require distributed computing solutions
The complexity of existing solutions
Technologies offering real-time distributed computing
A high-level view of the various components of Storm
A quick peek into the internals of Storm

By the end of the chapter, you will be able to understand the real-time scenarios and applications of Apache Storm. You should be aware of solutions available in the market and reasons as to why Storm is still the best open source choice.

Distributed computing problems

Let's dive deep and identify some of the problems that require distributed solutions. In the world we live in today, we are so attuned to the power of now and that's the paradigm that generated the need for distributed real-time computing. Sectors such as banking, healthcare, automotive manufacturing, and so on are hubs where real-time computing can either optimize or enhance the solutions.

Real-time business solution for credit or debit card fraud detection

Let's get acquainted with the problem depicted in the following figure; when we make any transaction using plastic money and swipe our debit or credit card for payment, the duration within which the bank has to validate or reject the transaction is less than five seconds. In less than five seconds, data or transaction details have to be encrypted, travel over secure network from servicing back bank to the issuing bank, then at the issuing back bank the entire fuzzy logic for acceptance or decline of the transaction has to be computed, and the result has to travel back over the secure network:

Real-time credit card fraud detection

The challenges such as network latency and delay can be optimized to some extent, but to achieve the preceding featuring transaction in less than 5 seconds, one has to design an application that is able to churn a considerable amount of data and generate results within 1 to 2 seconds.

Aircraft Communications Addressing and Reporting system

The Aircraft Communications Addressing and Reporting system (ACAR) demonstrates another typical use case that cannot be implemented without having a reliable real-time processing system in place. These Aircraft communication systems use satellite communication (SATCOM), and as per the following figure, they gather voice and packet data from all phases of flight in real time and are able to generate analytics and alerts on the data in real time.

Let's take the example from the figure in the preceding case. A flight encounters some real hazardous weather, say, electric Storms on a route, then that information is sent through satellite links and voice or data gateways to the air controller, which in real time detects and raises the alerts to deviate routes for all other flights passing through that area.

Healthcare

Here, let's introduce you to another problem on healthcare.

This is another very important domain where real-time analytics over high volume and velocity data has equipped the healthcare professionals with accurate and exact information in real time to take informed life-saving actions.

The preceding figure depicts the use case where doctors can take informed action to handle the medical situation of the patients. Data is collated from historic patient databases, drug databases, and patient records. Once the data is collected, it is processed, and live statistics and key parameters of the patient are plotted against the same collated data. This data can be used to further generate reports and alerts to aid the health care professionals.

Other applications

There are varieties of other applications where the power of real-time computing can either optimize or help people make informed decisions. It has become a great utility and aid in the following industries:

Manufacturing: A real-time defect detection mechanism can help optimize production costs. Generally, in the manufacturing segment QC is performed postproduction and there, due to one similar defect in goods, entire lot is rejected.
Transportation industry: Based on real-time traffic and weather data, transport companies can optimize their trade routes and save time and money.
Network optimization: Based on real-time network usage alerts, companies can design auto scale up and auto scale down systems for peak and off-peak hours.

Solutions for complex distributed use cases

Now that we understand the power that real-time solutions can get into various industry verticals, let's explore and find out what options we have to process vast amount of data being generated at a very fast pace.

The Hadoop solution

The Hadoop solution is one of the solutions to solve the problems that require dealing with humongous volumes of data. It works by executing jobs in a clustered setup.

MapReduce is a programming paradigm where we process large data sets by using a mapper function that processes a key and value pair and thus generates intermediate output again in the form of a key-value pair. Then a reduce function operates on the mapper output and merges the values associated with the same intermediate key and generates a result.

In the preceding figure, we demonstrate the simple word count MapReduce job where the simple word count job is being demonstrated using the MapReduce where:

There is a huge Big Data store, which can go up to zettabytes or petabytes.
Input datasets or files are split into blocks of configured size and each block is replicated onto multiple nodes in the Hadoop cluster depending upon the replication factor.
Each mapper job counts the number of words on the data blocks allocated to it.
Once the mapper is done, the words (which are actually the keys) and their counts are stored in a local file on the mapper node. The reducer then starts the reduce function and thus generates the result.
Reducers combine the mapper output and the final results are generated.

Big data, as we know, did provide a solution to processing and generating results out of humongous volumes of data, but that's predominantly a batch processing system and has almost no utility on a real-time use case.

A custom solution

Here we talk about a solution that was used in the social media world before we had a scalable framework such as Storm. A simplistic version of the problem could be that you need a real-time count of the tweets by each user; Twitter solved the problem by following the mechanism shown in the figure:

Here is the detailed information of how the preceding mechanism works:

A custom solution created a fire hose or queue onto which all the tweets are pushed.
A set of workers' nodes read data from the queue, parse the messages, and maintain counts of tweets by each user. The solution is scalable, as we can increase the number of workers to handle more load in the system. But the sharding algorithm for random distribution of the data among these workers nodes' should ensure equal distribution of data to all workers.
These workers assimilate this first level count into the next set of queues.
From these queues (the ones mentioned at level 1) second level of workers pick from these queues. Here, the data distribution among these workers is neither equal, nor random. The load balancing or the sharding logic has to ensure that tweets from the same user should always go to the same worker, to get the correct counts. For example, lets assume we have different users— "A, K, M, P, R, and L" and we have two workers "worker A" and "worker B". The tweets from user "A, K, and M" always goes to "worker A", and those of "P, R, and L users" goes to "worker B"; so the tweet counts for "A, K, and M" are always maintained by "worker A". Finally, these counts are dumped into the data store.

The queue-worker solution described in the preceding points works fine for our specific use case, but it has the following serious limitations:

It's very complex and specific to the use case
Redeployment and reconfiguration is a huge task
Scaling is very tedious
The system is not fault tolerant

Licensed proprietary solutions

After an open source Hadoop and custom Queue-worker solution, let's discuss the licensed options' proprietary solutions in the market to cater to the distributed real-time processing needs.

The Alabama Occupational Therapy Association (ALOTA) of big companies has invested in such products, because they clearly see where the future of computing is moving to. They can foresee demands of such solutions and support them in almost every vertical and domain. They have developed such solutions and products that let us do complex batch and real-time computing but that comes at a heavy license cost. A few solutions to name are from companies such as:

IBM: IBM has developed InfoSphere Streams for real-time ingestion, analysis, and correlation of data.
Oracle: Oracle has a product called Real Time Decisions (RTD) that provides analysis, machine learning, and predictions in real-time context
GigaSpaces: GigaSpaces has come up with a product called XAP that provides in-memory computation to deliver real-time results

Other real-time processing tools

There are few other technologies that have some similar traits and features such as Apache Storm and S4 from Yahoo, but it lacks guaranteed processing. Spark is essentially a batch processing system with some features on micro-batching, which could be utilized as real time.

What you will learn

Integrate Storm applications with RabbitMQ for realtime analysis and processing of messages Monitor highly distributed applications using Nagios Integrate the Cassandra data store with Storm Develop and maintain distributed Storm applications in conjunction with Cassandra and In Memory Database (memcache) Build a Trident topology that enables realtime computing with Storm Tune performance for Storm topologies based on the SLA and requirements of the application Use Esper with the Storm framework for rapid development of applications

What do you get with a Packt Subscription?