In the previous chapter, we discussed the various transformations and actions that we can execute on data in Spark. To load data in Spark for processing, it needs to interact with external storage systems. In this chapter, we will learn how to read/store data in Spark from/to different storage systems. Also, we will discuss the libraries that help to process different varieties of structured/unstructured data in Spark.
You're reading from Apache Spark 2.x for Java Developers
As we know, Spark is a processing engine that can help to process a humongous amount of data; however, to process the data it should be read from external systems. In this section, we will learn how to store/read data in Spark from/to different storage systems.
We will start with the local filesystem and then will implement Spark with some popular storage systems used in the big data world.
It is very straightforward and easy to read data from a local filesystem in Spark. Let's discuss this with examples, as follows:
Let's put first things first. First, create (or reuse) the Maven project described in the previous chapter and create a Java class (with main method) for our application. We will start by creating a JavaSparkContext
:
SparkConf conf =new SparkConf().setMaster("local").setAppName("Local File system Example"); JavaSparkContext jsc=new JavaSparkContext(conf);
To read a text file in Spark, the textFile
method...
Apache Spark extensively supports various file formats either natively or with the support of libraries written in Java or other programming languages. Compressed file formats, as well as Hadoop's file format, are very well integrated with Spark. Some of the common file formats widely used in Spark are as follows:
Plain text can be read in Spark by calling the textFile()
function on SparkContext
. However, for specially formatted text, such as files separated by white space, tab, tilde (~
), and so on, users need to iterate over each line of the text using the map()
function and then split them on specific characters, such as tilde (~
) in the case of tilde-separated files.
Consider, we have tilde-separated files that consist of data of people in the following format:
name~age~occupation
Let's load this file as an RDD of Person
objects, as follows:
Person POJO:
public class Person implements Serializable { private String Name...
In the first part of this chapter, we talked about how to load data in Spark from various data sources. We have seen code examples of connecting to some popular data sources such as HDFS, S3, and so on. In later parts, we discussed processing data in some widely used structured formats, along with the code examples.
In the next chapter, we will discuss Spark clusters in detail. We will discuss the cluster setup process and some popular cluster managers available with Spark in detail. Also, we will look at how to debug Spark applications in cluster mode.