Reader small image

You're reading from  Scala Data Analysis Cookbook

Product typeBook
Published inOct 2015
Reading LevelIntermediate
Publisher
ISBN-139781784396749
Edition1st Edition
Languages
Right arrow
Author (1)
Arun Manivannan
Arun Manivannan
author image
Arun Manivannan

Arun Manivannan has been an engineer in various multinational companies, tier-1 financial institutions, and start-ups, primarily focusing on developing distributed applications that manage and mine data. His languages of choice are Scala and Java, but he also meddles around with various others for kicks. He blogs at http://rerun.me. Arun holds a master's degree in software engineering from the National University of Singapore. He also holds degrees in commerce, computer applications, and HR management. His interests and education could probably be a good dataset for clustering.
Read more about Arun Manivannan

Right arrow

Chapter 2. Getting Started with Apache Spark DataFrames

In this chapter, we will cover the following recipes:

  • Getting Apache Spark

  • Creating a DataFrame from CSV

  • Manipulating DataFrames

  • Creating a DataFrame from Scala case classes

Introduction


Apache Spark is a cluster computing platform that claims to run about 10 times faster than Hadoop. In general terms, we could consider it as a means to run our complex logic over massive amounts of data at a blazingly fast speed. The other good thing about Spark is that the programs that we write are much smaller than the typical MapReduce classes that we write for Hadoop. So, not only do our programs run faster but it also takes less time to write them.

Spark has four major higher level tools built on top of the Spark Core: Spark Streaming, Spark MLlib (machine learning), Spark SQL (an SQL interface for accessing the data), and GraphX (for graph processing). The Spark Core is the heart of Spark. Spark provides higher level abstractions in Scala, Java, and Python for data representation, serialization, scheduling, metrics, and so on.

At the risk of stating the obvious, a DataFrame is one of the primary data structures used in data analysis. They are just like an RDBMS table that...

Getting Apache Spark


In this recipe, we'll take a look at how to bring Spark into our project (using SBT) and how Spark works internally.

How to do it...

Let's now throw some Spark dependencies into our build.sbt file so that we can start playing with them in subsequent recipes. For now, we'll just focus on three of them: Spark Core, Spark SQL, and Spark MLlib. We'll take a look at a host of other Spark dependencies as we proceed further in this book:

  1. Under a brand new folder (which will be your project root), create a new file called build.sbt.

  2. Next, let's add the Spark libraries to the project dependencies.

  3. Note that Spark 1.4.x requires Scala 2.10.x. This becomes the first section of our build.sbt:

    organization := "com.packt"
    
    name := "chapter1-spark-csv"
    
    scalaVersion := "2.10.4"
    
    val sparkVersion="1.4.1"
    
    libraryDependencies ++= Seq(
      "org.apache.spark...

Creating a DataFrame from CSV


In this recipe, we'll look at how to create a new DataFrame from a delimiter-separated values file.

How to do it...

This recipe involves four steps:

  1. Add the spark-csv support to our project.

  2. Create a Spark Config object that gives information on the environment that we are running Spark in.

  3. Create a Spark context that serves as an entry point into Spark. Then, we proceed to create an SQLContext from the Spark context.

  4. Load the CSV using the SQLContext.

  5. CSV support isn't first-class in Spark, but it is available through an external library from Databricks. So, let's go ahead and add that to our build.sbt.

    After adding the spark-csv dependency, our complete build.sbt looks like this:

    organization := "com.packt"
    
    name := "chapter1-spark-csv"
    
    scalaVersion := "2.10.4"
    
    val sparkVersion...

Manipulating DataFrames


In the previous recipe, we saw how to create a DataFrame. The next natural step, after creating DataFrames, is to play with the data inside them. Other than the numerous functions that help us to do that, we also find other interesting functions that help us sample the data, print the schema of the data, and so on. We'll take a look at them one by one in this recipe.

How to do it...

Now, let's see how we can manipulate DataFrames using the following subrecipes:

  • Printing the schema of the DataFrame

  • Sampling data in the DataFrame

  • Selecting specific columns in the DataFrame

  • Filtering data by condition

  • Sorting data in the frame

  • Renaming columns

  • Treating the DataFrame as a relational table to execute SQL queries

  • Saving the DataFrame as a file

Printing the schema of the...

Creating a DataFrame from Scala case classes


In this recipe, we'll see how to create a new DataFrame from Scala case classes.

How to do it...

  1. We create a new entity called Employee with the id and name fields, like this:

    case class Employee(id:Int, name:String)
    

    Similar to the previous recipe, we create SparkContext and SQLContext.

    val conf = new SparkConf().setAppName("colRowDataFrame").setMaster("local[2]")
    
    //Initialize Spark context with Spark configuration.  This is the core entry point to do anything with Spark
    val sc = new SparkContext(conf)
    
    //The easiest way to query data in Spark is to use SQL queries.
    val sqlContext=new SQLContext(sc)
    
  2. We can source these employee objects from a variety of sources, such as an RDBMS data source, but for the sake of this example, we construct a list...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scala Data Analysis Cookbook
Published in: Oct 2015Publisher: ISBN-13: 9781784396749
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Arun Manivannan

Arun Manivannan has been an engineer in various multinational companies, tier-1 financial institutions, and start-ups, primarily focusing on developing distributed applications that manage and mine data. His languages of choice are Scala and Java, but he also meddles around with various others for kicks. He blogs at http://rerun.me. Arun holds a master's degree in software engineering from the National University of Singapore. He also holds degrees in commerce, computer applications, and HR management. His interests and education could probably be a good dataset for clustering.
Read more about Arun Manivannan