Practical Real-time Data Processing and Analytics

By Shilpi Saxena , Saurabh Gupta
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Introducing Real-Time Analytics

About this book

With the rise of Big Data, there is an increasing need to process large amounts of data continuously, with a shorter turnaround time. Real-time data processing involves continuous input, processing and output of data, with the condition that the time required for processing is as short as possible.

This book covers the majority of the existing and evolving open source technology stack for real-time processing and analytics. You will get to know about all the real-time solution aspects, from the source to the presentation to persistence. Through this practical book, you’ll be equipped with a clear understanding of how to solve challenges on your own.

We’ll cover topics such as how to set up components, basic executions, integrations, advanced use cases, alerts, and monitoring. You’ll be exposed to the popular tools used in real-time processing today such as Apache Spark, Apache Flink, and Storm. Finally, you will put your knowledge to practical use by implementing all of the techniques in the form of a practical, real-world use case.

By the end of this book, you will have a solid understanding of all the aspects of real-time data processing and analytics, and will know how to deploy the solutions in production environments in the best possible manner.

Publication date:
September 2017


Chapter 1. Introducing Real-Time Analytics

This chapter sets the context for the reader by providing an overview of the big data technology landscape in general and real–time analytics in particular. This provides an outline for the book conceptually, with an attempt to ignite the spark for inquisitiveness that will encourage readers to undertake the rest of the journey through the book.

The following topics will be covered:

  • What is big data?
  • Big data infrastructure
  • Real–time analytics – the myth and the reality
  • Near real–time solution – an architecture that works
  • Analytics – a plethora of possibilities
  • IOT – thoughts and possibilities
  • Cloud – considerations for NRT and IOT

What is big data?

Well to begin with, in simple terms, big data helps us deal with three V's – volume, velocity, and variety. Recently, two more V's were added to it, making it a five–dimensional paradigm; they are veracity and value

  • Volume: This dimension refers to the amount of data; look around you, huge amounts of data are being generated every second – it may be the email you send, Twitter, Facebook, or other social media, or it can just be all the videos, pictures, SMS messages, call records, and data from varied devices and sensors. We have scaled up the data–measuring metrics to terabytes, zettabytes and Yottabytes – they are all humongous figures. Look at Facebook alone; it's like ~10 billion messages on a day, consolidated across all users. We have ~5 billion likes a day and around ~400 million photographs are uploaded each day. Data statistics in terms of volume are startling; all of the data generated from the beginning of time to 2008 is kind of equivalent to what we generate in a day today, and I am sure soon it will be an hour. This volume aspect alone is making the traditional database dwarf to store and process this amount of data in reasonable and useful time frames, though a big data stack can be employed to store process and compute on amazingly large data sets in a cost–effective, distributed, and reliably efficient manner.
  • Velocity: This refers to the data generation speed, or the rate at which data is being generated. In today's world, where we mentioned that the volume of data has undergone a tremendous surge, this aspect is not lagging behind. We have loads of data because we are able to generate it so fast. Look at social media; things are circulated in seconds and they become viral, and the insight from social media is analysed in milliseconds by stock traders, and that can trigger lots of activity in terms of buying or selling. At a target point of sale counter it takes a few seconds for a credit card swipe, and within that fraudulent transaction processing, payment, bookkeeping, and acknowledgement is all done. Big data gives us the power to analyse the data at tremendous speed.
  • Variety: This dimension tackles the fact that the data can be unstructured. In the traditional database world, and even before that, we were used to having a very structured form of data that fitted neatly into tables. Today, more than 80% of data is unstructured – quotable examples are photos, video clips, social media updates, data from variety of sensors, voice recordings, and chat conversations. Big data lets you store and process this unstructured data in a very structured manner; in fact, it effaces the variety.
  • Veracity: It's all about validity and correctness of data. How accurate and usable is the data? Not everything out of millions and zillions of data records is corrected, accurate, and referable. That's what actual veracity is: how trustworthy the data is and what the quality of the data is. Examples of data with veracity include Facebook and Twitter posts with nonstandard acronyms or typos. Big data has brought the ability to run analytics on this kind of data to the table. One of the strong reasons for the volume of data is veracity.
  • Value: This is what the name suggests: the value that the data actually holds. It is unarguably the most important V or dimension of big data. The only motivation for going towards big data for processing super large data sets is to derive some valuable insight from it. In the end, it's all about cost and benefits.

Big data is a much talked about technology across businesses and the technical world today. There are myriad domains and industries that are convinced of its usefulness, but the implementation focus is primarily application-oriented, rather than infrastructure-oriented. The next section predominantly walks you through the same.


Big data infrastructure

Before delving further into big data infrastructure, let's have a look at the big data high–level landscape.

The following figure captures high–level segments that demarcate the big data space:

It clearly depicts the various segments and verticals within the big data technology canvas (bottom up).

The key is the bottom layer that holds the data in scalable and distributed mode:

  • Technologies: Hadoop, MapReduce, Mahout, Hbase, Cassandra, and so on
  • Then, the next level is the infrastructure framework layer that enables the developers to choose from myriad infrastructural offerings depending upon the use case and its solution design
  • Analytical Infrastructure: EMC, Netezza, Vertica, Cloudera, Hortonworks
  • Operational Infrastructure: Couchbase, Teradata, Informatica and many more
  • Infrastructure as a service (IAAS): AWS, Google cloud and many more
  • Structured Databases: Oracle, SQLServer, MySQL, Sybase and many more
  • The next level specializes in catering to very specific needs in terms of
  • Data As A Service (DaaS): Kaggale, Azure, Factual and many more
  • Business Intelligence (BI): Qlikview, Cognos, SAP BO and many more
  • Analytics and Visualizations: Pentaho, Tableau, Tibco and many more

Today, we see traditional robust RDBMS struggling to survive in a cost–effective manner as a tool for data storage and processing. The scaling of traditional RDBMS, at the compute power expected to process huge amount of data with low latency came at a very high price. This led to the emergence of new technologies, which were low cost, low latency, highly scalable at low cost/open source. To our rescue comes The Yellow Elephant—Hadoop that took the data storage and computation arena by surprise. It's designed and developed as a distributed framework for data storage and computation on commodity hardware in a highly reliable and scalable manner. The key computational methodology Hadoop works on involves distributing the data in chunks over all the nodes in a cluster, and then processing the data concurrently on all the nodes.

Now that you are acquainted with the basics of big data and the key segments of the big data technology landscape, let's take a deeper look at the big data concept with the Hadoop framework as an example. Then, we will move on to take a look at the architecture and methods of implementing a Hadoop cluster; this will be a close analogy to high–level infrastructure and the typical storage requirements for a big data cluster. One of the key and critical aspect that we will delve into is information security in the context of big data.

A couple of key aspects that drive and dictate the move to big data infraspace are highlighted in the following figure:

  • Cluster design: This is the most significant and deciding aspect for infrastructural planning. The cluster design strategy of the infrastructure is basically the backbone of the solution; the key deciding elements for the same are the application use cases and requirements, workload, resource computation (depending upon memory intensive and compute intensive computations), and security considerations.

Apart from compute, memory, and network utilization, another very important aspect to be considered is storage which will be either cloud–based or on the premises. In terms of the cloud, the option could be public, private, or hybrid, depending upon the consideration and requirements of use case and the organization

  • Hardware architecture: A lot on the storage cost aspect is driven by the volume of the data to be stored, archival policy, and the longevity of the data. The decisive factors are as follows:
    • The computational needs of the implementations (whether the commodity components would suffice, or if the need is for high–performance GPUs).
    • What are the memory needs? Are they low, moderate, or high? This depends upon the in–memory computation needs of the application implementations.
  • Network architecture: This may not sound important, but it is a significant driver in big data computational space. The reason is that the key aspect for big data is distributed computation, and thus, network utilization is much higher than what would have been in the case of a single–server, monolithic implementation. In distributed computation, loads of data and intermediate compute results travel over the network; thus, the network bandwidth becomes the throttling agent for the overall solution and depending on key aspect for selection of infrastructure strategy. Bad design approaches sometimes lead to network chokes, where data spends less time in processing but more in shuttling across the network or waiting to be transferred over the wire for the next step in execution.
  • Security architecture: Security is a very important aspect of any application space, in big data, it becomes all the more significant due to the volume and diversity of the data, and due to network traversal of the data owing to the compute methodologies. The aspect of the cloud computing and storage options adds further needs to the complexity of being a critical and strategic aspect of big data infraspace.

Real–time analytics – the myth and the reality

One of the biggest truths about the real–time analytics is that nothing is actually real–time; it's a myth. In reality, it's close to real–time. Depending upon the performance and ability of a solution and the reduction of operational latencies, the analytics could be close to real–time, but, while day-by-day we are bridging the gap between real–time and near–real–time, it's practically impossible to eliminate the gap due to computational, operational, and network latencies.

Before we go further, let's have a quick overview of what the high–level expectations from these so called real–time analytics solutions are. The following figure captures the high–level intercept of the same, where, in terms of data we are looking for a system that could process millions of transactions with a variety of structured and unstructured data sets. My processing engine should be ultra–fast and capable of handling very complex joined-up and diverse business logic, and at the end, it is also expected to generate astonishingly accurate reports, revert to my ad–hoc queries in a split–second, and render my visualizations and dashboards with no latency:

As if the previous aspects of the expectations from the real–time solutions were not sufficient, to have them rolling out to production, one of the basic expectations in today's data generating and zero downtime era, is that the system should be self–managed/managed with minimalistic efforts and it should be inherently built in a fault tolerant and auto–recovery manner for handling most if not all scenarios. It should also be able to provide my known basic SQL kind of interface in similar/close format.

However outrageously ridiculous the previous expectations sound, they are perfectly normal and minimalistic expectation from any big data solution of today. Nevertheless, coming back to our topic of real–time analytics, now that we have touched briefly upon the system level expectations in terms of data, processing and output, the systems are being devised and designed to process zillions of transactions and apply complex data science and machine learning algorithms on the fly, to compute the results as close to real time as possible. The new term being used is close to real–time/near real–time or human real–time. Let's dedicate a moment to having a look at the following figure that captures the concept of computation time and the context and significance of the final insight:

As evident in the previous figure, in the context of time:

  • Ad–hoc queries over zeta bytes of data take up computation time in the order of hour(s) and are thus typically described as batch. The noteworthy aspect being depicted in the previous figure with respect to the size of the circle is that it is an analogy to capture the size of the data being processed in diagrammatic form.
  • Ad impressions/Hashtag trends/deterministic workflows/tweets: These use cases are predominantly termed as online and the compute time is generally in the order of 500ms/1 second. Though the compute time is considerably reduced as compared to previous use cases, the data volume being processed is also significantly reduced. It would be very rapidly arriving data stream of a few GBs in magnitude.
  • Financial tracking/mission critical applications: Here, the data volume is low, the data arrival rate is extremely high, the processing is extremely high, and low latency compute results are yielded in time windows of a few milliseconds.

Apart from the computation time, there are other significant differences between batch and real–time processing and solution designing:

Batch processing

Real–time processing

Data is at rest

Data is in motion

Batch size is bounded

Data is essentially coming in as a stream and is un–bounded

Access to entire data

Access to data in current transaction/sliding window

Data processed in batches

Processing is done at event, window, or at the most at micro batch level

Efficient, easier administration

Real–time insights, but systems are fragile as compared to batch


Towards the end of this section, all I would like to emphasis is that a near real–time (NRT) solution is as close to true real–time as it is practically possible attain. So, as said, RT is actually a myth (or hypothetical) while NRT is a reality. We deal with and see NRT applications on a daily basis in terms of connected vehicles, prediction and recommendation engines, health care, and wearable appliances.

There are some critical aspects that actually introduce latency to total turnaround time, or TAT as we call it. It's actually the time lapse between occurrences of an event to the time actionable insight is generated out of it.

  • The data/events generally travel from diverse geographical locations over the wire (internet/telecom channels) to the processing hub. There is some time lapsed in this activity.
  • Processing:
    • Data landing: Due to security aspects, data generally lands on an edge node and is then ingested into the cluster
    • Data cleansing: The data veracity aspect needs to be catered for, to eliminate bad/incorrect data before processing
    • Data massaging and enriching: Binding and enriching transnational data with dimensional data
    • Actual processing
    • Storing the results
      • All previous aspects of processing incur:
        • CPU cycles
        • Disk I/O
        • Network I/O
        • Active marshaling and un–marshalling of data serialization aspects.

So, now that we understand the reality of real–time analytics, let's look a little deeper into the architectural segments of such solutions.


Near real–time solution – an architecture that works

In this section, we will learn about what all architectural patterns are possible to build a scalable, sustainable, and robust real–time solution.

A high–level NRT solution recipe looks very straight and simple, with a data collection funnel, a distributed processing engine, and a few other ingredients like in–memory cache, stable storage, and dashboard plugins.

At a high level, the basic analytics process can be segmented into three shards, which are depicted well in previous figure:

  • Real–time data collection of the streaming data
  • Distributed high–performance computation on flowing data
  • Exploring and visualizing the generated insights in the form of query–able consumable layer/dashboards

If we delve a level deeper, there are two contending proven streaming computation technologies on the market, which are Storm and Spark. In the coming section we will take a deeper look at a high–level NRT solution that's derived from these stacks.

NRT – The Storm solution

This solution captures the high–level streaming data in real–time and routes it through some Queue/broker: Kafka or RabbitMQ. Then, the distributed processing part is handled through Storm topology, and once the insights are computed, they can be written to a fast write data store like Cassandra or some other queue like Kafka for further real–time downstream processing:

As per the figure, we collect real–time streaming data from diverse data sources, through push/pull collection agents like Flume, Logstash, FluentD, or Kafka adapters. Then, the data is written to Kafka partitions, Storm topologies pull/read the streaming data from Kafka and processes this flowing data in its topology, and writes the insights/results to Cassandra or some other real–time dashboards.

NRT – The Spark solution

At a very high–level, the data flow pipeline with Spark is very similar to the Storm architecture depicted in the previous figure, but one the most critical aspects of this flow is that Spark leverages HDFS as a distributed storage layer. Here, have a look before we get into further dissection of the overall flow and its nitty–gritty:

As with a typical real–time analytic pipeline, we ingest the data using one of the streaming data grabbing agents like Flume or Logstash. We introduce Kafka to ensure decoupling into the system between the sources agents. Then, we have the Spark streaming component that provides a distributed computing platform for processing the data, before we dump the results to some stable storage unit, dashboard, or Kafka queue.

One essential difference between previous two architectural paradigms is that, while Storm is essentially a real–time transactional processing engine that is, by default, good at processing the incoming data event by event, Spark works on the concept of micro–batching. It's essentially a pseudo real–time compute engine, where close to real–time compute expectations can be met by reducing the micro batch size. Storm is essentially designed for lightning fast processing, thus all transformations are in memory because any disk operation generates latency; this feature is a boon and bane for Storm (because memory is volatile if things break, everything has to be reworked and intermediate results are lost). On the other hand, Spark is essentially backed up by HDFS and is robust and more fault tolerant, as intermediaries are backed up in HDFS.

Over the last couple of years, big data applications have seen a brilliant shift as per the following sequence:

  1. Batch only applications (early Hadoop implementations)
  2. Streaming only (early Storm implementations)
  3. Could be both (custom made combinations of the previous two)
  4. Should be both (Lambda architecture)

Now, the question is: why did the above evolution take place? Well, the answer is that, when folks were first acquainted with the power of Hadoop, they really liked building the applications which could process virtually any amount of data and could scale up to any level in a seamless, fault tolerant, non–disruptive way. Then we moved to an era where people realized the power of now and ambitious processing became the need, with the advent of scalable, distributed processing engines like Storm. Storm was scalable and came with lighting–fast processing power and guaranteed processing. But then, something changed; we realized the limitations and strengths of both Hadoop batch systems and Storm real–time systems: the former were catering to my appetite for volume and the latter was excellent at velocity. My real–time applications were perfect, but they were performing over a small window of the entire data set and did not have any mechanism for correction of data/results at some later time. Conversely, while my Hadoop implementations were accurate and robust, they took a long time to arrive at any conclusive insight. We reached a point where we replicated complete/part solutions to arrive at a solution involving the combination of both batch and real–time implementations. One of the very recent NRT architectural patterns is Lambda architecture, which is a most sought after solution that combines the best of both batch and real–time implementations, without having any need to replicate and maintain two solutions. It gives me volume and velocity, which is an edge over earlier architecture, and it can cater to a wider set of use cases.


Lambda architecture – analytics possibilities

Now that we have introduced this wonderful architectural pattern, let's take a closer look at it before delving into the possible analytic use cases that can be implemented with this new pattern.

We all know that, at base level, Hadoop gives me vast storage, and has HDFS and a very robust processing engine in the form of MapReduce, which can handle a humongous amount of data and can perform myriad computations. However, it has a long turnaround time (TAT) and it's a batch system that helps us cope with the volume aspect of big data. If we need speed and velocity for processing and are looking for a low–latency solution, we have to resort to a real–time processing engine that could quickly process the latest or the recent data and derive quick insights that are actionable in the current time frame. But along with velocity and quick TAT, we also need newer data to be progressively integrated into the batch system for deep batch analytics that are required to execute on entire data sets. So, essentially we land in a situation where I need both batch and real–time systems, the optimal architectural combination of this pattern is called Lambda architecture (λ).

The following figure captures the high–level design of this pattern:

The solution is both technology and language agnostic; at a high–level it has the following three layers:

  • The batch layer
  • The speed layer
  • The serving layer

The input data is fed to both the batch and speed layers, where the batch layer works at creating the precomputed views of the entire immutable master data. This layer is predominately an immutable data store with write once and many bulk reads.

The speed layer handles the recent data and maintains only incremental views over the recent set of the data. This layer has both random reads and writes in terms of data accessibility.

The crux of the puzzle lies in the intelligence of the serving layer, where the data from both the batch and speed layers is merged and the queries are catered for, so we get the best of both the worlds seamlessly. The close to real–time requests are handled from the data from the incremental views (they have low retention policy) from the speed layer while the queries referencing the older data are catered to by the master data views generated in the batch layer. This layer caters only to random reads and no random writes, though it does handle batch computations in the form of queries and joins and bulk writes.

However, Lambda architecture is not a one-stop solution for all hybrid use cases; there are some key aspects that need to be taken care of:

  • Always think distributed
  • Account and plan for failures
  • Rule of thumb: data is immutable
  • Finally, plan for failures

Now that we have acquainted ourselves well with the prevalent architectural patterns in real–time analytics, let us talk about the use cases that are possible in this segment:

The preceding figure highlights the high–level domains and various key use cases that may be executed.


IOT – thoughts and possibilities

The Internet of Things: the term that was coined in 1999 by Kevin Ashton, has become one of the most promising door openers of the decade. Although we had an IoT precursor in the form of M2M and instrumentation control for industrial automation, the way IoT and the era of connected smart devices has arrived, is something that has never happened before. The following figure will give you a birds–eye view of the vastness and variety of the reach of IoT applications:

We are all surrounded by devices that are smart and connected; they have the ability to sense, process, transmit, and even act, based on their processing. The age of machines that was a science fiction a few years ago has become reality. I have connected vehicles that can sense and get un–locked/locked if I walk to them or away from them with keys. I have proximity sensing beacons in my supermarkets which sense my proximity to shelf and flash the offers to my cell phone. I have smart ACs that regulate the temperature based on the number of people in the room. My smart offices save electricity by switching the lights and ACs off in empty conference rooms. The list seems to be endless and growing every second.

At the heart of it IoT, is nothing but an ecosystem of connected devices which have the ability to communicate over the internet. Here, devices/things could be anything, like a sensor device, a person with a wearable, a place, a plant, an animal, a machine – well, virtually any physical item you could think of on this planet can be connected today. There are predominantly seven layers to any IoT platform; these are depicted and described in the following figure:

Following is quick description of all the 7 IoT application layers:

  • Layer 1: Devices, sensors, controllers and so on
  • Layer 2: Communication channels, network protocols and network elements, the communication, and routing hardware — telecom, Wi–Fi, and satellite
  • Layer 3: Infrastructure — it could be in-house or on the cloud (public, private, or hybrid)
  • Layer 4: Here comes the big data ingestion layer, the landing platform where the data from things/devices is collected for the next steps
  • Layer 5: The processing engine that does the cleansing, parsing, massaging, and analysis of the data using complex processing, machine learning, artificial intelligence, and so on, to generate insights in form of reports, alerts, and notifications
  • Layer 6: Custom apps, the pluggable secondary interfaces like visualization dashboards, downstream applications, and so on form part of this layer
  • Layer 7: This is the layer that has the people and processes that actually act on the insights and recommendations generated from the following systems

At an architectural level, a basic reference architecture of an IOT application is depicted in the following image:

In the previous figure, if we start with a bottom up approach, the lowest layers are devices that are sensors or sensors powered by computational units like RaspberryPi or Ardunio. The communication and data transference is generally, at this point, governed by lightweight options like Messaging Queue Telemetry Transport (MQTT) and Constrained Application protocol (CoAP) which are fast replacing the legacy options like HTTP. This layer is actually in conjunction to the aggregation or bus layer, which is essentially a Mosquitto broker and forms the event transference layer from source, that is, from the device to the processing hub. Once we reach the processing hub, we have all the data at the compute engine ready to be swamped into action and we can analyse and process the data to generate useful actionable insights. These insights are further integrated to web service API consumable layers for downstream applications. Apart from these horizontal layers, there are cross–cutting layers which handle the device provisioning and device management, identity and access management layer.

Now that we understand the high–level architecture and layers for standard IoT application, the next step is to understand the key aspects where an IoT solution is constrained and what the implications are on overall solution design:

  • Security: This is a key concern area for the entire data-driven solution segment, but the concept of big data and devices connected to the internet makes the entire system more susceptible to hacking and vulnerable in terms of security, thus making it a strategic concern area to be addressed while designing the solution at all layers for data at rest and in motion.
  • Power consumption/battery life: We are devising solutions for devices and not human beings; thus, the solutions we design for them should be of very low power consumption overall without taxing or draining battery life.
  • Connectivity and communication: The devices, unlike humans, are always connected and can be very chatty. Here again, we need a lightweight protocol for overall communication aspects for low latency data transfer.
  • Recovery from failures: These solutions are designed to run for billions of data process and in a self–sustaining 24/7 mode. The solution should be built with the capability to diagnose the failures, apply back pressure and then self–recover from the situation with minimal data loss. Today, IoT solutions are being designed to handle sudden spikes of data, by detecting a latency/bottle neck and having the ability to auto–scale–up and down elastically.
  • Scalability: The solutions need to be designed in a mode that its linearly scalable without the need to re–architect the base framework or design, the reason being that this domain is exploding with an unprecedented and un–predictable number of devices being connected with a whole plethora of future use cases which are just waiting to happen.

Next are the implications of the previous constraints of the IoT application framework, which surface in the form of communication channels, communication protocols, and processing adapters.

In terms of communication channel providers, the IoT ecosystem is evolving from telecom channels and LTEs to options like:

  • Direct Ethernet/WiFi/3G
  • LoRA
  • Bluetooth Low Energy (BLE)
  • RFID/Near Field communication (NFC)
  • Medium range radio mesh networks like Zigbee

For communication protocols, the de–facto standard that is on the board as of now is MQTT, and the reasons for its wide usage are evident:

  • It is extremely light weight
  • It has very low footprint in terms of network utilization, thus making the communication very fast and less taxing
  • It comes with a guaranteed delivery mechanism, ensuing that the data will eventually be delivered, even over fragile networks
  • It has low power consumption
  • It optimizes the flow of data packets over the wire to achieve low latency and lower footprints
  • It is a bi–directional protocol, and thus is suited both for transferring data from the device as well as transferring the data to the device
  • Its better suited for a situation in which we have to transmit a high volume of short messages over the wire

Edge analytics

Post evolution and IOT revolution, edge analytics are another significant game changer. If you look at IOT applications, the data from the sensors and devices needs to be collated and travels all the way to the distributed processing unit, which is either on the premises or on the cloud. This lift and shift of data leads to significant network utilization; it makes the overall solution latent to transmission delays.

These considerations led to the development of a new kind of solution and in turn a new arena of IOT computations — the term is edge analytics and, as the name suggests, it's all about pushing the processing to the edge, so that the data is processed at its source.

The following figure shows the bifurcation of IOT into:

  • Edge analytics
  • Core analytics

As depicted in the previous figure, the IOT computation is now divided into segments, as follows:

  • Sensor–level edge analytics: Wherein data is processed and some insights are derived at the device level itself
  • Edge analytics: These are the analytics wherein the data is processed and insights are derived at the gateway level
  • Core analytics: This flavour of analytics requires all data to arrive at a common compute engine (distributed storage and distributed computation) and then the high–complexity processing is done to derive actionable insights for people and processes

Some of the typical use cases for sensor/edge analytics are:

  • Industrial IOT (IIOT): The sensors are embedded in various pieces of equipment, machinery, and sometimes even shop floors. The sensors generate data and the devices have the innate capability to process the data and generate alerts/recommendations to improve the performance or yield.
  • IoT in health care: Smart devices can participate in edge processing and can contribute to detection of early warning signs and raise alerts for appropriate medical situations
  • In the world of wearable devices, edge processing can make tracking and safety very easy

Today, If I look around, my world is surrounded by connected devices—like my smart AC, my smart refrigerator, and smart TV; they all send out the data to a central hub or my mobile phone and are easily controllable from there. Now, the things are actually getting smart; they are evolving from being connected, to being smart enough to compute, process, and predict. For instance, my coffee maker is smart enough to be connected to my car traffic, my office timing, so that it predicts my daily routine and my arrival time and has hot fresh coffee ready the moment I need it.


Cloud – considerations for NRT and IOT

Cloud is nothing but a term used to identify a capability where computational capability is available over internet. We have all been acquainted with physical machines, servers, and data centres. The advent of the cloud has taken us to a world of virtualization where we are moving out to virtual nodes, virtualized clusters, and even virtual data centers. Now, I can have virtualization of hardware to play with and have my clusters built using VMs spawned over a few actual machines. So, it's like having software at play over physical hardware. The next step was the cloud, where we have all virtual compute capability hosted and available over the net.

The services that are part of the cloud bouquet are:

  • Infrastructure as a Service (IaaS): It's basically a cloud variant of fundamental physical computers. It actually replaces the actual machines, the servers and hardware storage and the networking by a virtualization layer operating over the net. The IaaS lets you build this entire virtual infrastructure, where it's actually software that's imitating actual hardware.
  • Platform as a Service (PaaS): Once we have sorted out the hardware virtualization part, the next obvious step is to think about the next layer that operates over the raw computer hardware. This is the one that gels the programs with components like databases, servers, file storage, and so on. Here, for instance, if a database is exposed as PaaS then the programmer can use that as a service without worrying about the lower details like storage capacity, data protection, encryption, replication, and so on. Renowned examples of PaaS are Google App Engine and Heroku.
  • Software as a Service (SaaS): This one is the topmost layer in the stack of cloud computation; it's actually the layer that provides the solution as a service. These services are charged on a per user or per month basis and this model ensures that the end users have flexibility to enroll for and use the service without any license fee or locking period. Some of the most widely known, and typical, examples are, and Google Apps.

Now that we have been introduced to and acquainted with cloud, the next obvious point to understand is what this buzz is all about and why is it that the advent of the cloud is closing curtains on the era of traditional data centers. Let's understand some of the key benefits of cloud computing that have actually made this platform a hot selling cake for NRT and IOT applications

  • It's on demand: The users can provision the computing components/resources as per the need and load. There is no need to make huge investments for the next X years on infrastructure in the name of future scaling and headroom. One can provision a cluster that's adequate to meet current requirements and that can then be scaled when needed by requesting more on–demand instances. So, the guarantee I get here as a user, is that I will get an instance when I demand for the same.
  • It lets us build truly elastic applications, which means that depending upon the load and the need, my deployments can scale up and scale down. This is a huge advantage and the way the cloud does it is very cost effective too. If I have an application which sees an occasional surge in traffic on the first of every month, then, on the cloud I don't have to provision the hardware required to meet the demand of surge on the first of the month for the entire 30 days. Instead, I can provision what I need on an average day and build a mechanism to scale up my cluster to meet the surge on the first and then automatically scale back to average size on the second of the month.
  • It's pay as you go: well, this is the most interesting feature of the cloud that beats the traditional hardware provisioning system, where to set up a data centers one has to plan up front for a huge investment. Cloud data centre, don't invite any such cost and one can pay only for the instances that are running, and that too, is generally handled on an hourly basis.


In this chapter we have discussed various aspects of the big data technology landscape and big data as an infrastructure and computation candidate. We walked the reader through various considerations and caveats to be taken into account while designing and deciding upon the big data infrastructural space. We had our uses introduced to reality of real–time analytics, the NRT architecture, and also touched upon the vast variety of use cases which can possibly be addressed by harnessing the power of IOT and NRT. Towards the end of the chapter, we briefly touched upon the concept of edge computing and cloud infrastructure for IOT.

In the next chapter, we will have the readers moving a little deeper into the real–time analytical application, architecture, and concepts. We will touch upon the basic building blocks of an NRT application, the technology stack required and the challenges encountered while developing it.

About the Authors

  • Shilpi Saxena

    Shilpi Saxena is an IT professional and also a technology evangelist. She is an engineer who has had exposure to various domains (machine to machine space, healthcare, telecom, hiring, and manufacturing). She has experience in all the aspects of conception and execution of enterprise solutions. She has been architecting, managing, and delivering solutions in the big data space for the last 3 years. She also handles a high-performance and geographically-distributed team of elite engineers.

    Shilpi has more than 12 years (3 years in the big data space) of experience in the development and execution of various facets of enterprise solutions both in the products and services dimensions of the software industry. An engineer by degree and profession, she has worn varied hats, such as developer, technical leader, product owner, tech manager, and so on, and she has seen all the flavors that the industry has to offer. She has architected and worked through some of the pioneers' production implementations in Big Data on Storm and Impala with auto-scaling in AWS.

    Shilpi also authored Real-time Analytics with Storm and Cassandra with Packt Publishing.

    Browse publications by this author
  • Saurabh Gupta

    Saurabh Gupta is an software engineer who has worked aspects of software requirements, designing, execution, and delivery. Saurabh has more than 3 years of experience working in Big Data domain. Saurabh is handling and designing real time as well as batch processing projects running in production including technologies like Impala, Storm, NiFi, Kafka and deployment on AWS using Docker. Saurabh also worked in product development and delivery.

    Saurabh has total 10 years (3+ years in Big Data) rich experience in IT industry. Saurabh has exposure in various IOT use-cases including Telecom, HealthCare, Smart city, Smart cars and so on.

    Browse publications by this author
Practical Real-time Data Processing and Analytics
Unlock this book and the full library FREE for 7 days
Start now