Reader small image

You're reading from  Apache Spark 2.x for Java Developers

Product typeBook
Published inJul 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787126497
Edition1st Edition
Languages
Right arrow
Authors (2):
Sourav Gulati
Sourav Gulati
author image
Sourav Gulati

Sourav Gulati is associated with software industry for more than 7 years. He started his career with Unix/Linux and Java and then moved towards big data and NoSQL World. He has worked on various big data projects. He has recently started a technical blog called Technical Learning as well. Apart from IT world, he loves to read about mythology.
Read more about Sourav Gulati

Sumit Kumar
Sumit Kumar
author image
Sumit Kumar

Sumit Kumar is a developer with industry insights in telecom and banking. At different junctures, he has worked as a Java and SQL developer, but it is shell scripting that he finds both challenging and satisfying at the same time. Currently, he delivers big data projects focused on batch/near-real-time analytics and the distributed indexed querying system. Besides IT, he takes a keen interest in human and ecological issues.
Read more about Sumit Kumar

View More author details
Right arrow

Chapter 5. Working with Data and Storage

In the previous chapter, we discussed the various transformations and actions that we can execute on data in Spark. To load data in Spark for processing, it needs to interact with external storage systems. In this chapter, we will learn how to read/store data in Spark from/to different storage systems. Also, we will discuss the libraries that help to process different varieties of structured/unstructured data in Spark.

Interaction with external storage systems


As we know, Spark is a processing engine that can help to process a humongous amount of data; however, to process the data it should be read from external systems. In this section, we will learn how to store/read data in Spark from/to different storage systems.

We will start with the local filesystem and then will implement Spark with some popular storage systems used in the big data world.

Interaction with local filesystem

It is very straightforward and easy to read data from a local filesystem in Spark. Let's discuss this with examples, as follows:

Let's put first things first. First, create (or reuse) the Maven project described in the previous chapter and create a Java class (with main method) for our application. We will start by creating a JavaSparkContext:

SparkConf conf =new SparkConf().setMaster("local").setAppName("Local File system Example"); 
JavaSparkContext jsc=new JavaSparkContext(conf); 

To read a text file in Spark, the textFile method...

Working with different data formats


Apache Spark extensively supports various file formats either natively or with the support of libraries written in Java or other programming languages. Compressed file formats, as well as Hadoop's file format, are very well integrated with Spark. Some of the common file formats widely used in Spark are as follows:

Plain and specially formatted text

Plain text can be read in Spark by calling the textFile() function on SparkContext. However, for specially formatted text, such as files separated by white space, tab, tilde (~), and so on, users need to iterate over each line of the text using the map() function and then split them on specific characters, such as tilde (~) in the case of tilde-separated files.

Consider, we have tilde-separated files that consist of data of people in the following format:

name~age~occupation 

Let's load this file as an RDD of Person objects, as follows:

Person POJO:

public class Person implements Serializable {
  private String Name...

Summary


In the first part of this chapter, we talked about how to load data in Spark from various data sources. We have seen code examples of connecting to some popular data sources such as HDFS, S3, and so on. In later parts, we discussed processing data in some widely used structured formats, along with the code examples.

In the next chapter, we will discuss Spark clusters in detail. We will discuss the cluster setup process and some popular cluster managers available with Spark in detail. Also, we will look at how to debug Spark applications in cluster mode.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x for Java Developers
Published in: Jul 2017Publisher: PacktISBN-13: 9781787126497
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Sourav Gulati

Sourav Gulati is associated with software industry for more than 7 years. He started his career with Unix/Linux and Java and then moved towards big data and NoSQL World. He has worked on various big data projects. He has recently started a technical blog called Technical Learning as well. Apart from IT world, he loves to read about mythology.
Read more about Sourav Gulati

author image
Sumit Kumar

Sumit Kumar is a developer with industry insights in telecom and banking. At different junctures, he has worked as a Java and SQL developer, but it is shell scripting that he finds both challenging and satisfying at the same time. Currently, he delivers big data projects focused on batch/near-real-time analytics and the distributed indexed querying system. Besides IT, he takes a keen interest in human and ecological issues.
Read more about Sumit Kumar