You're reading from Hands-On Data Analysis with Scala

Product typeBook

Published inMay 2019

Reading LevelExpert

PublisherPackt

ISBN-139781789346114

Edition1st Edition

Languages

Scala

Concepts

Data Analysis

Author (1)

Rajesh Gupta

Working with Data at Scale

Data is being produced at an accelerated pace with advancements in technology. The widespread usage and adoption of the Internet of Things (IoT) is a great example of this. These specifically purposed IoT devices are tens of billions in number and are growing rapidly. Many of these devices, using their sensors, continually produce observations as data. Even though the data might be small as a unit, combined together it becomes humongous. IoT is just one example of how much and how fast the data is being created.

This kind of data is sometimes referred to as big data that is too big to fit on a single machine for storage and computing purposes. Big data has three important properties:

Variety: Data in different formats and structures
Velocity: New data arriving at a fast rate
Volume: Huge overall data size

In the prior chapters, we learned how to deal...

Working with data at scale

Working with data at scale and handling large data volumes significantly changes data analysis and processing. To get an intuition for the problems with data at scale, let's look at a simple problem of computing the median value of numbers. The median is the mid-point that splits the data into two parts. Use the following numbers as an example:

8 1 2 7 9 0 5

We will first sort the numbers in ascending order:

0 1 2 5 7 8 9

The median value is 5, because it splits the data into two halves, where half of the values are below five and the another half are above five.

Now, let's imagine that the count of these numbers was of the order of billions. Let's explore a solution to this problem in Scala REPL. Traditionally, we would need to do the following steps to compute the median value:

Load the data into memory on a single computer's...

Cost considerations

As the size of data grows, there are many factors to consider to manage costs effectively. Some of the costs associated with data are direct, while others are indirect. A clear and well-defined data strategy plays a central role in managing these costs and maximizing the value of data.

There are multiple points of view to consider when looking at the cost:

Data storage
Data governance

Data storage

Not at all data is created equal. Some types of data have more value than the others. The value of data might also be sensitive to its age and might start to decrease as the data ages. At the same time, some data is accessed more frequently than others. All of these factors, and many more, determine how the...

Reliability considerations

Processing large datasets requires reliability to be looked at from a slightly different point of view. It is quite common to have a small percentage of errors in such large datasets. An acceptable error tolerance level can only be defined by business rules. Large datasets are generally processed by a network of computers, where failures are more common compared to processing on a single computer. In this section, we will look at the following aspects of error handling:

Input data errors
Processing failures

Input data errors

As a general guideline, it is crucial to measure and monitor the number of errors in the input data over time. If the quality of the input data is bad, then any analysis performed...

Summary

In this chapter, we looked at working with data at scale. Working with large datasets requires a paradigm shift in how the data is processed. Traditional methods that work with smaller datasets generally don't work well with large datasets, because these are designed to work on a single computer. These methods need to be re-engineered to work effectively with large datasets. For scalability, we need to turn to distributed computing; however, this introduces significant additional complexity because of the network being involved, where failures are more common. Using good, time-tested frameworks, such as Apache Spark, is the key to addressing these concerns.

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Data Analysis with Scala

Published in: May 2019Publisher: PacktISBN-13: 9781789346114

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Rajesh Gupta

Rajesh is a Hands-on Big Data Tech Lead and Enterprise Architect with extensive experience in the full life cycle of software development. He has successfully architected, developed and deployed highly scalable data solutions using Spark, Scala and Hadoop technology stack for several enterprises. A passionate, hands-on technologist, Rajesh has masters degrees in Mathematics and Computer Science from BITS, Pilani (India).
Read more about Rajesh Gupta

Other recommended products

Related to this chapter

Professional Scala

This book teaches you how to build and contribute to Scala programs, recognizing common patterns and techniques used with the language. You’ll learn how to write concise, functional code with Scala. After an introduction to core concepts, syntax, and writing example applications with scalac, you’ll learn about the Scala Collections API and how the language handles type safety via static types out-of-the-box. You’ll then learn about advanced functional programming patterns, and how you can write your own Domain Specific Languages (DSLs). By the end of the book, you’ll be equipped with the skills you need to successfully build smart, efficient applications in Scala that can be compiled to the JVM.

BookJul 2018186 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Apache Spark 2.x Machine Learning Cookbook

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.

BookSep 2017666 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Modern Scala Projects

Scala is a multipurpose programming language, especially for analyzing large datasets without impacting the application performance. Its functional libraries can interact with databases and build scalable frameworks that create robust data pipelines. This book showcases how you can use Scala and its constructs to meet specific project demands.

BookJul 2018334 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Hands-On Big Data Analytics with PySpark

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.

BookMar 2019182 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Machine Learning with Spark

Spark ML is the machine learning module of Spark. It uses in-memory RDDs to process machine learning models faster for clustering, classification, and regression.

BookApr 2017532 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Learning PySpark

This book will get you to grips with the Spark Python API. You’ll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

BookFeb 2017274 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages