Getting Started with Apache Spark DataFrames

Arun Manivannan

September 2015

 In this article article about Arun Manivannan’s book Scala Data Analysis Cookbook, we will cover the following recipes:

  • Getting Apache Spark ML – a framework for large-scale machine learning
  • Creating a data frame from CSV

(For more resources related to this topic, see here.)

Getting started with Apache Spark

Breeze is the building block of Spark MLLib, the machine learning library for Apache Spark. In this recipe, we'll see how to bring Spark into our project (using SBT) and look at how it works internally.

The code for this recipe could be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1....

How to do it...

Pulling Spark ML into our project is just a matter of adding a few dependencies on our build.sbt file: spark-core, spark-sql, and spark-mllib:

  1. Under a brand new folder (which will be our project root), we create a new file called build.sbt.
  2. Next, let's add to the project dependencies the Spark libraries:
    organization := "com.packt"
    
    name := "chapter1-spark-csv"
    
    scalaVersion := "2.10.4"
    
    val sparkVersion="1.3.0"
    
    libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % sparkVersion,
    "org.apache.spark" %% "spark-sql" % sparkVersion,
    "org.apache.spark" %% "spark-mllib" % sparkVersion
    )
    
    resolvers ++= Seq(
    "Apache HBase" at "https://repository.apache.org/content/repositories/releases",
    "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/"
    )

How it works...

Spark has four major higher level tools built on top of the Spark Core: Spark Streaming, Spark ML Lib (Machine Learning), Spark SQL (An SQL interface for accessing data), and GraphX (for graph processing). The Spark Core is the heart of Spark, providing higher level abstractions in various languages for data representation, serialization, scheduling, metrics, and so on.

For this recipe, we skipped streaming and GraphX and added the remaining three libraries.

There’s more…

Apache Spark is a cluster computing platform that claims to run about 100 times faster than Hadoop (that's a mouthful). In our terms, we could consider that as a means to run our complex logic over a massive amount of data at a blazingly high speed. The other good thing about Spark is that the programs we write are much smaller than the typical Map Reduce classes that we write for Hadoop. So, not only do our programs run faster, but it also takes lesser time to write them in the first place.

Creating a data frame from CSV

In this recipe, we'll look at how to create a new data frame from a Delimiter Separated Values (DSV) file.

The code for this recipe could be found athttps://github.com/arunma/ScalaDataAnalysisCookbook/tree/master/chapter1-spark-csv in the DataFrameCSV class.

How to do it...

CSV support isn't first-class in Spark but is available through an external library from databricks. So, let's go ahead and add that up in build.sbt:

    1. After adding the spark-csv dependency, our complete build.sbt looks as follows:
      organization := "com.packt"
      
      name := "chapter1-spark-csv"
      
      scalaVersion := "2.10.4"
      
      val sparkVersion="1.3.0"
      
      libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-core" % sparkVersion,
      "org.apache.spark" %% "spark-sql" % sparkVersion,
      "org.apache.spark" %% "spark-mllib" % sparkVersion,
      "com.databricks" %% "spark-csv" % "1.0.3"
      )
      
      resolvers ++= Seq(
      "Apache HBase" at"https://repository.apache.org/content/repositories/releases",
      "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/"
      )
      
      fork := true
    2. Before we create the actual data frame, there are three steps that we ought to do: create the Spark configuration, create the Spark context, and create the SQL context. SparkConf holds all of the information for running this Spark cluster. For this recipe, we are running locally, and we intend to use only two cores in the machine—local[2]:
      val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]")

      For this recipe, we'll be running Spark on standalone mode.

    3. Now let's load our pipe-separated file:
      org.apache.spark.sql.DataFrame
      
      val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|')

How it works...

The csvFile function of sqlContext accepts the full filePath of the file to be loaded. If the CSV has a header, then the useHeader flag will read the first row as column names. The delimiter flag, as expected, defaults to a comma, but you can override the character as needed.

Instead of using the csvFile function, you can also use the load function available in the SQL context. The load function accepts the format of the file (in our case, it is CSV) and options as a map. We can specify the same parameters that we specified earlier using Map, like this:

val options=Map("header"->"true", "path"->"ModifiedStudent.csv")

val newStudents=sqlContext.load("com.databricks.spark.csv",options)

Summary

In this article, you learned in detail Apache Spark ML, a framework for large-scale machine learning.

Then we saw the creation of a data frame from CSV with the help of example code.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

Scala Data Analysis Cookbook

Explore Title
comments powered by Disqus