Reader small image

You're reading from  Hands-On Data Analysis with Scala

Product typeBook
Published inMay 2019
Reading LevelExpert
PublisherPackt
ISBN-139781789346114
Edition1st Edition
Languages
Right arrow
Author (1)
Rajesh Gupta
Rajesh Gupta
author image
Rajesh Gupta

Rajesh is a Hands-on Big Data Tech Lead and Enterprise Architect with extensive experience in the full life cycle of software development. He has successfully architected, developed and deployed highly scalable data solutions using Spark, Scala and Hadoop technology stack for several enterprises. A passionate, hands-on technologist, Rajesh has masters degrees in Mathematics and Computer Science from BITS, Pilani (India).
Read more about Rajesh Gupta

Right arrow

Working with Data at Scale

Data is being produced at an accelerated pace with advancements in technology. The widespread usage and adoption of the Internet of Things (IoT) is a great example of this. These specifically purposed IoT devices are tens of billions in number and are growing rapidly. Many of these devices, using their sensors, continually produce observations as data. Even though the data might be small as a unit, combined together it becomes humongous. IoT is just one example of how much and how fast the data is being created.

This kind of data is sometimes referred to as big data that is too big to fit on a single machine for storage and computing purposes. Big data has three important properties:

  • Variety: Data in different formats and structures
  • Velocity: New data arriving at a fast rate
  • Volume: Huge overall data size

In the prior chapters, we learned how to deal...

Working with data at scale

Working with data at scale and handling large data volumes significantly changes data analysis and processing. To get an intuition for the problems with data at scale, let's look at a simple problem of computing the median value of numbers. The median is the mid-point that splits the data into two parts. Use the following numbers as an example:

8 1 2 7 9 0 5 

We will first sort the numbers in ascending order:

0 1 2 5 7 8 9

The median value is 5, because it splits the data into two halves, where half of the values are below five and the another half are above five.

Now, let's imagine that the count of these numbers was of the order of billions. Let's explore a solution to this problem in Scala REPL. Traditionally, we would need to do the following steps to compute the median value:

  1. Load the data into memory on a single computer's...

Cost considerations

As the size of data grows, there are many factors to consider to manage costs effectively. Some of the costs associated with data are direct, while others are indirect. A clear and well-defined data strategy plays a central role in managing these costs and maximizing the value of data.

There are multiple points of view to consider when looking at the cost:

  • Data storage
  • Data governance

Data storage

Not at all data is created equal. Some types of data have more value than the others. The value of data might also be sensitive to its age and might start to decrease as the data ages. At the same time, some data is accessed more frequently than others. All of these factors, and many more, determine how the...

Reliability considerations

Processing large datasets requires reliability to be looked at from a slightly different point of view. It is quite common to have a small percentage of errors in such large datasets. An acceptable error tolerance level can only be defined by business rules. Large datasets are generally processed by a network of computers, where failures are more common compared to processing on a single computer. In this section, we will look at the following aspects of error handling:

  • Input data errors
  • Processing failures

Input data errors

As a general guideline, it is crucial to measure and monitor the number of errors in the input data over time. If the quality of the input data is bad, then any analysis performed...

Summary

In this chapter, we looked at working with data at scale. Working with large datasets requires a paradigm shift in how the data is processed. Traditional methods that work with smaller datasets generally don't work well with large datasets, because these are designed to work on a single computer. These methods need to be re-engineered to work effectively with large datasets. For scalability, we need to turn to distributed computing; however, this introduces significant additional complexity because of the network being involved, where failures are more common. Using good, time-tested frameworks, such as Apache Spark, is the key to addressing these concerns.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Analysis with Scala
Published in: May 2019Publisher: PacktISBN-13: 9781789346114
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rajesh Gupta

Rajesh is a Hands-on Big Data Tech Lead and Enterprise Architect with extensive experience in the full life cycle of software development. He has successfully architected, developed and deployed highly scalable data solutions using Spark, Scala and Hadoop technology stack for several enterprises. A passionate, hands-on technologist, Rajesh has masters degrees in Mathematics and Computer Science from BITS, Pilani (India).
Read more about Rajesh Gupta