Packt+ | Advance your knowledge in tech

You're reading from Apache Spark 2.x Cookbook

Product typeBook

Published inMay 2017

Reading LevelIntermediate

Publisher

ISBN-139781787127265

Edition1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Data Processing

Author (1)

Rishi Yadav

Launching Spark on Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute instances in the cloud. Amazon EC2 provides the following features:

On-demand delivery of IT resources via the Internet
Provisioning of as many instances as you like
Payment for the hours during which you use instances, such as your utility bill
No setup cost, no installation, and no overhead at all
Shutting down or terminating instances when you no longer need them
Making such instances available on familiar operating systems

EC2 provides different types of instances to meet all your compute needs, such as general-purpose instances, microinstances, memory-optimized instances, storage-optimized instances, and others. They also have a Free Tier of microinstances for trial purposes.

Getting ready

The spark-ec2 script comes bundled with Spark and makes it easy to launch, manage, and shut down clusters on Amazon EC2.

Before you start, do the following things: log in to the Amazon AWS account via http://aws.amazon.com.

Click on Security Credentials under your account name in the top-right corner.
Click on Access Keys and Create New Access Key:

Download the key file (let's save it in the /home/hduser/kp folder as spark-kp1.pem).
Set permissions on the key file to 600.
Set environment variables to reflect access key ID and secret access key (replace the sample values with your own values):

$ echo "export AWS_ACCESS_KEY_ID="AKIAOD7M2LOWATFXFKQ"" >> /home/hduser/.bashrc
$ echo "export 
       AWS_SECRET_ACCESS_KEY="+Xr4UroVYJxiLiY8DLT4DLT4D4sxc3ijZGMx1D3pfZ2q"" >> 
         /home/hduser/.bashrc
$ echo "export PATH=$PATH:/opt/infoobjects/spark/ec2" >> /home/hduser/.bashrc

How to do it…

Spark comes bundled with scripts to launch the Spark cluster on Amazon EC2. Let's launch the cluster using the following command:

$ cd /home/hduser
$ spark-ec2 -k <key-pair> -i <key-file> -s <num-slaves> launch <cluster-name>
<key-pair> - name of EC2 keypair created in AWS
<key-file> the private key file you downloaded
<num-slaves> number of slave nodes to launch
<cluster-name> name of the cluster

Launch the cluster with the example value:

$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-
          version 2  -s 3 launch spark-cluster

Sometimes, the default availability zones are not available; in that case, retry sending the request by specifying the specific availability zone you are requesting:

$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -z us-east-1b --
          hadoop-major-version 2  -s 3 launch spark-cluster

If your application needs to retain data after the instance shuts down, attach EBS volume to it (for example, 10 GB space):

$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-
          version 2 -ebs-vol-size 10 -s 3 launch spark-cluster

If you use Amazon's spot instances, here is the way to do it:

$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -spot-price=0.15 
          --hadoop-major-version 2  -s 3 launch spark-cluster

Note

Spot instances allow you to name your own price for Amazon EC2's computing capacity. You simply bid for spare Amazon EC2 instances and run them whenever your bid exceeds the current spot price, which varies in real time and is based on supply and demand (source: www.amazon.com).

After completing the preceding launch process, check the status of the cluster by going to the webUI URL that will be printed at the end:

Check the status of the cluster:

Now, to access the Spark cluster on EC2, connect to the master node using secure shell protocol (SSH):

$ spark-ec2 -k spark-kp1 -i /home/hduser/kp/spark-kp1.pem  login spark-cluster

The following image illustrates the result you'll get:

Check the directories in the master node and see what they do:

Directory	Description
`ephemeral-hdfs`	This is the Hadoop instance for which data is ephemeral and gets deleted when you stop or restart the machine.
`persistent-hdfs`	Each node has a very small amount of persistent storage (approximately 3 GB). If you use this instance, data will be retained in that space.
`hadoop-native`	This refers to the native libraries that support Hadoop, such as snappy compression libraries.
`scala`	This refers to the Scala installation.
`shark`	This refers to the Shark installation (Shark is no longer supported and is replaced by Spark SQL).
`spark`	This refers to the Spark installation.
`spark-ec2`	This refers to the files that support this cluster deployment.
`tachyon`	This refers to the Tachyon installation.

Check the HDFS version in an ephemeral instance:

$ ephemeral-hdfs/bin/hadoop version
Hadoop 2.0.0-chd4.2.0

Check the HDFS version in a persistent instance with the following command:

$ persistent-hdfs/bin/hadoop version
Hadoop 2.0.0-chd4.2.0

Change the configuration level of the logs:

$ cd spark/conf

The default log level information is too verbose, so let's change it to Error:

Create the log4.properties file by renaming the template:

$ mv log4j.properties.template log4j.properties

Open log4j.properties in vi or your favorite editor:

$ vi log4j.properties

Change the second line from | log4j.rootCategory=INFO, console to | log4j.rootCategory=ERROR, console.

Copy the configuration to all the slave nodes after the change:

$ spark-ec2/copydir spark/conf

You should get something like this:

Destroy the Spark cluster:

$ spark-ec2 destroy spark-cluster

Author (1)

Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav

Other recommended products

Related to this chapter

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Hands-On Data Analysis with Scala

This book will help you perform effective data analysis with Scala using practical examples. You will come across different challenges and their effective solutions for a variety of data processing tasks - be it data exploration, data manipulation, or real-time data analysis using Apache Spark.

BookMay 2019298 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

BookApr 2019220 pages

Mastering Apache Spark 2.x

Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and more. This book will familiarize you with the newest features in Apache Spark 2.x, and take you through an exciting journey of complex Big Data processing, analytics, streaming analytics as well as advanced machine learning with Apache Spark. During the course of the book, you will leverage different functionalities and modules of Apache Spark such as Spark SQL, Spark MLlib, Spark Streaming, SparkML and more, to build efficient data processing solutions. By the end of this book, you will have all the necessary knowledge to use Apache Spark effectively in your day to day tasks.

BookJul 2017354 pages

Hands-On Big Data Analytics with PySpark

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.

BookMar 2019182 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Learning Apache Flink

BookFeb 2017280 pages

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

BookJun 2018330 pages

Learning PySpark

This book will get you to grips with the Spark Python API. You’ll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

BookFeb 2017274 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Apache Spark 2.x Cookbook

Launching Spark on Amazon EC2

Getting ready

How to do it…

Note

See also

Author (1)

Learning Spark SQL

Apache Spark Quick Start Guide

Hands-On Data Analysis with Scala

This book will help you perform effective data analysis with Scala using practical examples. You will come across different challenges and their effective solutions for a variety of data processing tasks - be it data exploration, data manipulation, or real-time data analysis using Apache Spark.

Scala and Spark for Big Data Analytics

Learning Apache Spark 2

Machine Learning with Scala Quick Start Guide

Mastering Apache Spark 2.x

Hands-On Big Data Analytics with PySpark

Apache Spark 2.x for Java Developers

Learning Apache Flink

PySpark Cookbook

Learning PySpark

This book will get you to grips with the Spark Python API. You’ll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook