You're reading from Scala Data Analysis Cookbook
Apache Spark is a cluster computing platform that claims to run about 10 times faster than Hadoop. In general terms, we could consider it as a means to run our complex logic over massive amounts of data at a blazingly fast speed. The other good thing about Spark is that the programs that we write are much smaller than the typical MapReduce classes that we write for Hadoop. So, not only do our programs run faster but it also takes less time to write them.
Spark has four major higher level tools built on top of the Spark Core: Spark Streaming, Spark MLlib (machine learning), Spark SQL (an SQL interface for accessing the data), and GraphX (for graph processing). The Spark Core is the heart of Spark. Spark provides higher level abstractions in Scala, Java, and Python for data representation, serialization, scheduling, metrics, and so on.
At the risk of stating the obvious, a DataFrame is one of the primary data structures used in data analysis. They are just like an RDBMS table that...
In this recipe, we'll take a look at how to bring Spark into our project (using SBT) and how Spark works internally.
Note
The code for this recipe can be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/build.sbt.
Let's now throw some Spark dependencies into our build.sbt
file so that we can start playing with them in subsequent recipes. For now, we'll just focus on three of them: Spark Core, Spark SQL, and Spark MLlib. We'll take a look at a host of other Spark dependencies as we proceed further in this book:
Under a brand new folder (which will be your project root), create a new file called
build.sbt
.Next, let's add the Spark libraries to the project dependencies.
Note that Spark 1.4.x requires Scala 2.10.x. This becomes the first section of our
build.sbt
:organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.4.1" libraryDependencies ++= Seq( "org.apache.spark...
In this recipe, we'll look at how to create a new DataFrame from a delimiter-separated values file.
Note
The code for this recipe can be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/src/main/scala/com/packt/scaladata/spark/csv/DataFrameCSV.scala.
This recipe involves four steps:
Add the
spark-csv
support to our project.Create a Spark Config object that gives information on the environment that we are running Spark in.
Create a Spark context that serves as an entry point into Spark. Then, we proceed to create an
SQLContext
from the Spark context.Load the CSV using the
SQLContext
.CSV support isn't first-class in Spark, but it is available through an external library from Databricks. So, let's go ahead and add that to our
build.sbt
.After adding the
spark-csv
dependency, our completebuild.sbt
looks like this:organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion...
In the previous recipe, we saw how to create a DataFrame. The next natural step, after creating DataFrames, is to play with the data inside them. Other than the numerous functions that help us to do that, we also find other interesting functions that help us sample the data, print the schema of the data, and so on. We'll take a look at them one by one in this recipe.
Note
The code and the sample file for this recipe could be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/src/main/scala/com/packt/scaladata/spark/csv/DataFrameCSV.scala.
Now, let's see how we can manipulate DataFrames using the following subrecipes:
Printing the schema of the DataFrame
Sampling data in the DataFrame
Selecting specific columns in the DataFrame
Filtering data by condition
Sorting data in the frame
Renaming columns
Treating the DataFrame as a relational table to execute SQL queries
Saving the DataFrame as a file
In this recipe, we'll see how to create a new DataFrame from Scala case classes.
Note
The code for this recipe can be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/src/main/scala/com/packt/scaladata/spark/csv/DataFrameFromCaseClasses.scala.
We create a new entity called
Employee
with theid
andname
fields, like this:case class Employee(id:Int, name:String)
Similar to the previous recipe, we create
SparkContext
andSQLContext
.val conf = new SparkConf().setAppName("colRowDataFrame").setMaster("local[2]") //Initialize Spark context with Spark configuration. This is the core entry point to do anything with Spark val sc = new SparkContext(conf) //The easiest way to query data in Spark is to use SQL queries. val sqlContext=new SQLContext(sc)
We can source these employee objects from a variety of sources, such as an RDBMS data source, but for the sake of this example, we construct a list...