You're reading from Machine Learning with Apache Spark Quick Start Guide

Product typeBook

Published inDec 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781789346565

Edition1st Edition

Languages

Java

Tools

Apache Spark

Concepts

Machine Learning

Author (1)

Jillur Quddus

A brief history of data

If you worked in the mainstream IT industry between the 1970s and early 2000s, it is likely that your organization's data was held either in text-based delimited files, spreadsheets, or nicely structured relational databases. In the case of the latter, data is modeled and persisted in pre-defined, and possibly related, tables representing the various entities found within your organization's data model, for example, according to employee or department. These tables contain rows of data across multiple columns representing the various attributes making up that entity; for example, in the case of employee, typical attributes include first name, last name, and date of birth.

Vertical scaling

As both your organization's data estate and the number of users requiring access to that data grew, high-performance remote servers would have been utilized, with access provisioned over the corporate network. These remote servers would typically either act as remote filesystems for file sharing or host relational database management systems (RDBMSes) in order to store and manage relational databases. As data requirements grew, these remote servers would have needed to scale vertically, meaning that additional CPU, memory, and/or hard disk space would have been installed. Typically, these relational databases would have stored anything between hundreds and potentially tens of millions of records.

Master/slave architecture

As a means of providing resilience and load balancing read requests, potentially, a master/slave architecture would have been employed whereby data is automatically copied from the master database server to physically distinct slave database server(s) utilizing near real-time replication. This technique requires that the master server be responsible for all write requests, while read requests could be offloaded and load balanced across the slaves, where each slave would hold a full copy of the master data. That way, if the master server ever failed for some reason, business-critical read requests could still be processed by the slaves while the master was being brought back online. This technique does have a couple of major disadvantages, however:

Scalability: The master server, by being solely responsible for processing write requests, limits the ability for the system to be scalable as it could quickly become a bottleneck.
Consistency and data loss: Since replication is near real-time, it is not guaranteed that the slaves would have the latest data at the point in time that the master server goes offline and transactions may be lost. Depending on the business application, either not having the latest data or losing data may be unacceptable.

Sharding

To increase throughput and overall performance, and as single machines reached their capacity to scale vertically in a cost-effective manner, it is possible that sharding would have been employed. This is one method of horizontal scaling whereby additional servers are provisioned and data is physically split over separate database instances residing on each of the machines in the cluster, as illustrated in Figure 1.1.

This approach would have allowed organizations to scale linearly to cater for increased data sizes while reusing existing database technologies and commodity hardware, thereby optimizing costs and performance for small- to medium-sized databases.

Crucially, however, these separate databases are standalone instances and have no knowledge of one another. Therefore, some sort of broker would be required that, based on a partitioning strategy, would keep track of where data was being written to for each write request and, thereafter, retrieve data from that same location for read requests. Sharding subsequently introduced further challenges, such as processing data queries, transformations, and joins that spanned multiple standalone database instances across multiple servers (without denormalizing data), thereby maintaining referential integrity and the repartitioning of data:

Figure 1.1: A simple sharding partitioning strategy

Data processing and analysis

Finally, in order to transform, process, and analyze the data sitting in these delimited text-based files, spreadsheets or relational databases, typically an analyst, data engineer or software engineer would have written some code.

This code, for example, could take the form of formulas or Visual Basic for Applications (VBA) for spreadsheets, or Structured Query Language (SQL) for relational databases, and would be used for the following purposes:

Loading data, including batch loading and data migration
Transforming data, including data cleansing, joins, merges, enrichment, and validation
Standard statistical aggregations, including computing averages, counts, totals, and pivot tables
Reporting, including graphs, charts, tables, and dashboards

To perform more complex statistical calculations, such as generating predictive models, advanced analysts could utilize more advanced programming languages, including Python, R, SAS, or even Java.

Crucially, however, this data transformation, processing, and analysis would have either been executed directly on the server in which the data was persisted (for example, SQL statements executed directly on the relational database server in competition with other business-as-usual read and write requests), or data would be moved over the network via a programmatic query (for example, an ODBC or JDBC connection), or via flat files (for example, CSV or XML files) to another remote analytical processing server. The code could then be executed on that data, assuming, of course, that the remote processing server had sufficient CPUs, memory and/or disk space in its single machine to execute the job in question. In other words, the data would have been moved to the code in some way or another.

Data becomes big

Fast forward to today—spreadsheets are still commonplace, and relational databases containing nicely structured data, whether partitioned across shards or not, are still very much relevant and extremely useful. In fact, depending on the use case, the data volumes, structure, and the computational complexity of the required processing, it could still be faster and more efficient to store and manage data via an RDBMS and process that data directly on the remote database server using SQL. And, of course, spreadsheets are still great for very small datasets and for simple statistical aggregations. What has changed, however, since the 1970s is the availability of more powerful and more cost-effective technology coupled with the introduction of the internet!

The internet has transformed the very essence of what we mean by data. Whereas before, data was thought of as text and numbers confined to spreadsheets or relational databases, it is now an organic and evolving asset in its own right being created and consumed on a mass scale by anyone that owns a smartphone, TV, or bank account. Data is being created every second around the world in virtually any format you can think of, from social media posts, images, videos, audio, and music to blog posts, online forums, articles, computer log files, and financial transactions. All of this structured, semi-structured, and unstructured data being created in both batche and real time can no longer be stored and managed by nicely organized, text-based delimited files, spreadsheets, or relational databases, nor can it all be physically moved to a remote processing server every time some analytical code is to be executed—a new breed of technology is required.

You have been reading a chapter from

Machine Learning with Apache Spark Quick Start Guide

Published in: Dec 2018Publisher: PacktISBN-13: 9781789346565

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jillur Quddus

Jillur Quddus is a lead technical architect, polyglot software engineer and data scientist with over 10 years of hands-on experience in architecting and engineering distributed, scalable, high-performance, and secure solutions used to combat serious organized crime, cybercrime, and fraud. Jillur has extensive experience of working within central government, intelligence, law enforcement, and banking, and has worked across the world including in Japan, Singapore, Malaysia, Hong Kong, and New Zealand. Jillur is both the founder of Keisan, a UK-based company specializing in open source distributed technologies and machine learning, and the lead technical architect at Methods, the leading digital transformation partner for the UK public sector.
Read more about Jillur Quddus

Other recommended products

Related to this chapter

Apache Kafka Quick Start Guide

Learn how to use Apache Kafka for efficient processing of distributed applications. This book focuses on programming rather than configuration management of Kafka clusters or Dev Ops. Each chapter focuses on a practical aspect and tries to avoid the tedious theoretical sections. By the end of this book, you will be familiar with solving everyday problems in fast data and processing pipelines.

BookDec 2018186 pages

Apache Kafka 1.0 Cookbook

Apache Kafka is an open source stream processing platform to handle real-time data feeds. This book is a highly practical guide to help you understand the fundamentals as well as the advanced applications of Apache Kafka as an enterprise messaging service. It begins with configuring the basic Kafka APIs, and then shows you how to set up Kafka clusters and basic Kafka operations. It covers the recently released Kafka version 1.0, the Confluent Platform and Kafka Streams. By the end of this book, you will have all the knowledge you need to take your understanding of Apache Kafka to the next level and tackle any problem you might encounter while working with it.

BookDec 2017250 pages

Big Data Processing with Apache Spark

Processing big data in real-time is challenging due to scalability, information consistency, and fault tolerance. This book shows you how you can use Spark to make your overall analysis workflow faster and more efficient. You'll learn all about the core concepts and tools within the Spark ecosystem, like Spark Streaming, the Spark Streaming API, machine learning extension, and structured streaming.

BookOct 2018142 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Apache Spark Deep Learning Cookbook

Apache Spark Deep Learning Cookbook presents useful tips and tricks to overcome any problem related to building efficient distributed deep learning applications on Apache Spark. With the help of this book, you will leverage powerful deep learning libraries such as TensorFlow to develop your models and ensure their optimum performance.

BookJul 2018474 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Big Data Analysis with Python

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control the data avalanche for you. With this book, you'll learn effective techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems.

BookApr 2019276 pages

Mastering Machine Learning on AWS

This book will help you master your skills in various artificial intelligence and machine learning services available on AWS. Through practical hands-on examples, you’ll learn how to use these services to generate impressive results. You will have a tremendous understanding of how to use a wide range of AWS services in your own organization.

BookMay 2019306 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Machine Learning with Spark

Spark ML is the machine learning module of Spark. It uses in-memory RDDs to process machine learning models faster for clustering, classification, and regression.

BookApr 2017532 pages

Hands-On Deep Learning with Apache Spark

Deep Learning is a subset of Machine Learning where data sets with several layers of complexity can be processed. This book teaches you the different techniques using which deep learning solutions can be implemented at scale, on Apache Spark. This will help you gain experience of implementing your deep learning models in many real-world use cases.

BookJan 2019322 pages

Learning PySpark

This book will get you to grips with the Spark Python API. You’ll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

BookFeb 2017274 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages