You're reading from Scala and Spark for Big Data Analytics

Product typeBook

Published inJul 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781785280849

Edition1st Edition

Languages

Scala

Tools

Apache Spark Hadoop

Concepts

Big Data

Authors (2):

Md. Rezaul Karim

Sridhar Alla

View More author details

Start Working with Spark – REPL and RDDs

"All this modern technology just makes people try to do everything at once."

- Bill Watterson

In this chapter, you will learn how Spark works; then, you will be introduced to RDDs, the basic abstractions behind Apache Spark, and you'll learn that they are simply distributed collections exposing Scala-like APIs. You will then see how to download Spark and how to make it run locally via the Spark shell.

In a nutshell, the following topics will be covered in this chapter:

Dig deeper into Apache Spark
Apache Spark installation
Introduction to RDDs
Using the Spark shell
Actions and Transformations
Caching
Loading and Saving data

Dig deeper into Apache Spark

Apache Spark is a fast in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming machine learning or SQL workloads that require fast interactive access to datasets. Apache Spark consists of Spark core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed application development.

Additional libraries built on top of the core allow the workloads for streaming, SQL, Graph processing, and machine learning. SparkML, for instance, is designed for Data science and its abstraction makes Data science easier.

In order to plan and carry out the distributed computations, Spark uses the concept of a job, which is executed across the worker nodes using Stages and Tasks. Spark consists of a driver, which orchestrates...

Apache Spark installation

Apache Spark is a cross-platform framework, which can be deployed on Linux, Windows, and a Mac Machine as long as we have Java installed on the machine. In this section, we will look at how to install Apache Spark.

Apache Spark can be downloaded from http://spark.apache.org/downloads.html

First, let's look at the pre-requisites that must be available on the machine:

Java 8+ (mandatory as all Spark software runs as JVM processes)
Python 3.4+ (optional and used only when you want to use PySpark)
R 3.1+ (optional and used only when you want to use SparkR)
Scala 2.11+ (optional and used only to write programs for Spark)

Spark can be deployed in three primary deployment modes, which we will look at:

Spark standalone
Spark on YARN
Spark on Mesos

Spark standalone...

Introduction to RDDs

A Resilient Distributed Dataset (RDD) is an immutable, distributed collection of objects. Spark RDDs are resilient or fault tolerant, which enables Spark to recover the RDD in the face of failures. Immutability makes the RDDs read-only once created. Transformations allow operations on the RDD to create a new RDD but the original RDD is never modified once created. This makes RDDs immune to race conditions and other synchronization problems.

The distributed nature of the RDDs works because an RDD only contains a reference to the data, whereas the actual data is contained within partitions across the nodes in the cluster.

Conceptually, a RDD is a distributed collection of elements spread out across multiple nodes in the cluster. We can simplify a RDD to better understand by thinking of a RDD as a large array of integers distributed across machines.

A RDD is...

Using the Spark shell

Spark shell provides a simple way to perform interactive analysis of data. It also enables you to learn the Spark APIs by quickly trying out various APIs. In addition, the similarity to Scala shell and support for Scala APIs also lets you also adapt quickly to Scala language constructs and make better use of Spark APIs.

Spark shell implements the concept of read-evaluate-print-loop (REPL), which allows you to interact with the shell by typing in code which is evaluated. The result is then printed on the console, without needing to be compiled, so building executable code.

Start it by running the following in the directory where you installed Spark:

./bin/spark-shell

Spark shell launches and the Spark shell automatically creates the SparkSession and SparkContext objects. The SparkSession is available as a Spark and the SparkContext is available as sc.

spark...

Actions and Transformations

RDDs are immutable and every operation creates a new RDD. Now, the two main operations that you can perform on an RDD are Transformations and Actions.

Transformations change the elements in the RDD such as splitting the input element, filtering out elements, and performing calculations of some sort. Several transformations can be performed in a sequence; however no execution takes place during the planning.

For transformations, Spark adds them to a DAG of computation and, only when driver requests some data, does this DAG actually gets executed. This is called lazy evaluation.

The reasoning behind the lazy evaluation is that Spark can look at all the transformations and plan the execution, making use of the understanding the Driver has of all the operations. For instance, if a filter transformation is applied immediately after some other transformation...

Caching

Caching enables Spark to persist data across computations and operations. In fact, this is one of the most important technique in Spark to speed up computations, particularly when dealing with iterative computations.

Caching works by storing the RDD as much as possible in the memory. If there is not enough memory then the current data in storage is evicted, as per LRU policy. If the data being asked to cache is larger than the memory available, the performance will come down because Disk will be used instead of memory.

You can mark an RDD as cached using either persist() or cache()

cache() is simply a synonym for persist(MEMORY_ONLY)

persist can use memory or disk or both:

persist(newLevel: StorageLevel)

The following are the possible values for Storage level:

Storage Level	Meaning
`MEMORY_ONLY`	Stores RDD as deserialized Java objects in the JVM. If the RDD does not...

Loading and saving data

Loading data into an RDD and Saving an RDD onto an output system both support several different methods. We will cover the most common ones in this section.

Loading data

Loading data into an RDD can be done by using SparkContext. Some of the most common methods are:.

textFile
wholeTextFiles
load from a JDBC datasource

textFile

textFile() can be used to load textFiles into an RDD and each line becomes an element in the RDD.

sc.textFile(name, minPartitions=None, use_unicode=True)

The following is an example of loading a textfile into an RDD using...

Summary

In this chapter, we discussed the internals of Apache Spark, what RDDs are, DAGs and lineages of RDDs, Transformations, and Actions. We also looked at various deployment modes of Apache Spark using standalone, YARN, and Mesos deployments. We also did a local install on our local machine and then looked at Spark shell and how it can be used to interact with Spark.

In addition, we also looked at loading data into RDDs and saving RDDs to external systems as well as the secret sauce of Spark's phenomenal performance, the caching functionality, and how we can use memory and/or disk to optimize the performance.

In the next chapter, we will dig deeper into RDD API and how it all works in Chapter 7, Special RDD Operations.

The rest of the chapter is locked

You have been reading a chapter from

Scala and Spark for Big Data Analytics

Published in: Jul 2017Publisher: PacktISBN-13: 9781785280849

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla

Other recommended products

Related to this chapter

Big Data Analytics with Hadoop 3

Apache Hadoop is the most popular platform for big data processing to build powerful analytics solutions. This book shows you how to do just that, with the help of practical examples. You will be well-versed with the analytical capabilities of Hadoop ecosystem with Apache Spark and Apache Flink to perform big data analytics by the end of this book.

BookMay 2018482 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Hands-On Big Data Analytics with PySpark

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.

BookMar 2019182 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

BookApr 2019220 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Hands-On Data Analysis with Scala

This book will help you perform effective data analysis with Scala using practical examples. You will come across different challenges and their effective solutions for a variety of data processing tasks - be it data exploration, data manipulation, or real-time data analysis using Apache Spark.

BookMay 2019298 pages

TensorFlow: Powerful Predictive Analytics with TensorFlow

Predictive analytics discovers hidden patterns from structured and unstructured data for automated decision making in business intelligence. Predictive decisions are becoming a huge trend worldwide, catering to wide industry sectors by predicting which decisions are more likely to give maximum results. TensorFlow, Google’s brainchild, is immensely popular and extensively used for predictive analysis.

BookMar 2018164 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

BookJun 2018330 pages

Modern Scala Projects

Scala is a multipurpose programming language, especially for analyzing large datasets without impacting the application performance. Its functional libraries can interact with databases and build scalable frameworks that create robust data pipelines. This book showcases how you can use Scala and its constructs to meet specific project demands.

BookJul 2018334 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages