Reader small image

You're reading from  Apache Spark 2.x Cookbook

Product typeBook
Published inMay 2017
Reading LevelIntermediate
Publisher
ISBN-139781787127265
Edition1st Edition
Languages
Right arrow
Author (1)
Rishi Yadav
Rishi Yadav
author image
Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav

Right arrow

Chapter 4. Working with External Data Sources

Apache Spark depends upon the big data pipeline to get data. The pipeline starts with source systems. The source system data ingress can be arbitrarily complex due to the following reasons:

  • Nature of the data (relational, non-relational)
  • Data being dirty (yes, it's more of a rule than exception)
  • Source data being at a different level of normalization (SAP data, for example, has an extremely high degree of normalization)
  • Lack of consistency in the data (data needs to be harmonized so that it speaks the same language)

In this chapter, we will explore how Apache Spark connects to various data sources. This chapter is divided into the following recipes:

  • Loading data from the local filesystem
  • Loading data from HDFS
  • Loading data from Amazon S3
  • Loading data from Apache Cassandra

Introduction


Spark provides a unified runtime for big data. Hadoop Distributed File System (HDFS) has traditionally been the most used storage platform for Spark as it has provided the most cost-effective storage for unstructured and semi-structured data on commodity hardware. This has been upended by public cloud storage systems, especially Amazon S3. This edition of the book reflects that reality with special emphasis on connectivity to S3.

That being said, Spark exclusively leverages Hadoop's InputFormat and OutputFormat interfaces. InputFormat is responsible for creating InputSplits from input data and dividing it further into records. OutputFormat is responsible for writing to storage. Following image illustrates InputFormat metaphorically:

We will start by writing to the local filesystem and then move over to loading data from HDFS. In the Loading data from HDFS recipe, we will cover the most common file format: regular text files. We will also explore loading data stored in Amazon S3...

Loading data from the local filesystem


Though the local filesystem is not a good fit to store big data due to disk size limitations and lack of distributed nature, technically you can load data in distributed systems using the local filesystem. But then the file/directory you are accessing has to be available on each node.

Note

Please note that if you are planning to use this feature to load side data, it is not a good idea. To load side data, Spark has a broadcast variable feature, which will be discussed in upcoming chapters. Enriching data with side data is a very common use-case and we will cover how to do it in the subsequent chapters. 

In this recipe, we will look at how to load data in Spark from the local filesystem.

How to do it...

Let's start with the example of Shakespeare's "to be or not to be":

  1. Create the words directory by using the following command:
$ mkdir words
  1. Get into the words directory:
$ cd words
  1. Create the sh.txt text file and enter "to be or not to be" in it:
$ echo "to be...

Loading data from HDFS


HDFS is the second most widely used big data storage system after Amazon S3. One of the reasons for the wide adoption of HDFS is schema on read. What this means is that HDFS does not put any restriction on data when data is being written. Any and all kinds of data are welcome and can be stored in a raw format. This feature makes it the ideal storage for raw unstructured data and semi-structured data.

When it comes to reading data, even unstructured data needs to be given some structure to make sense. Hadoop uses InputFormat to determine how to read the data. Spark provides complete support for Hadoop's InputFormat, so anything that can be read by Hadoop can be read by Spark as well.

The default InputFormat is TextInputFormat. TextInputFormat takes the byte offset of a line as a key and the content of a line as a value. Spark uses the spark.read.textFile method to read using TextInputFormat. It ignores the byte offset and creates a dataset of strings.

In this recipe, we...

Loading data from Amazon S3


If Spark is a MapReduce killer, Amazon S3 is an HDFS killer. S3 is what the ultimate dream of cloud storage can be thought of as. S3 is a foundational service on Amazon Web Services (AWS), and almost every application running on AWS uses S3 for storage. Not only end-user applications but also other AWS services use S3 extensively; following are a few examples:

  • Amazon Kinesis uses S3 as target storage
  • Amazon Elastic MapReduce has one storage mode in S3
  • Amazon Elastic Block Store (EBS) uses S3 to store snapshots
  • Amazon Relation Database Service (RDS) uses S3 to store snapshots
  • Amazon Redshift uses S3 for data staging
  • Amazon DynamoDB uses S3 for data staging

Following are some of the salient features of S3:

  • 11 9's of durability
  • 4 9's of availability
  • Typical cost being $30/TB per month while even lower cost options are available

Amazon Simple Storage Service (S3) provides developers and IT teams with a secure, durable, and scalable storage platform. The biggest advantage of...

Loading data from Apache Cassandra


Apache Cassandra is a NoSQL database, with a masterless ring cluster structure. While HDFS is a good fit for streaming data access, it does not work well with random access. For example, HDFS will work well when your average file size is 100 MB and you want to read the whole file. If you frequently access the nth line in a file or some other part as a record, HDFS would be too slow.

Relational databases have traditionally provided a solution to that, providing low latency, random access, but they do not work well with big data. NoSQL databases, such as Cassandra, fill the gap by providing relational database type access but in a distributed architecture on commodity servers.

In this recipe, we will load data from Cassandra as a Spark DataFrame. To make that happen, Datastax, the company behind Cassandra, has contributed spark-cassandra-connector. This package lets you load Cassandra tables as DataFrames, write back to Cassandra, and execute CQL queries.

How...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x Cookbook
Published in: May 2017Publisher: ISBN-13: 9781787127265
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav