Reader small image

You're reading from  Scala Data Analysis Cookbook

Product typeBook
Published inOct 2015
Reading LevelIntermediate
Publisher
ISBN-139781784396749
Edition1st Edition
Languages
Right arrow
Author (1)
Arun Manivannan
Arun Manivannan
author image
Arun Manivannan

Arun Manivannan has been an engineer in various multinational companies, tier-1 financial institutions, and start-ups, primarily focusing on developing distributed applications that manage and mine data. His languages of choice are Scala and Java, but he also meddles around with various others for kicks. He blogs at http://rerun.me. Arun holds a master's degree in software engineering from the National University of Singapore. He also holds degrees in commerce, computer applications, and HR management. His interests and education could probably be a good dataset for clustering.
Read more about Arun Manivannan

Right arrow

Chapter 7. Going Further

In this chapter, we will cover the following recipes:

  • Using Spark Streaming to subscribe to a Twitter stream

  • Using Spark as an ETL tool (pulling data from ElasticSearch and publishing it to Kafka)

  • Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream

  • Using GraphX to analyze Twitter data

  • Watching other Scala libraries of interest

Introduction


So far, the entire book has concentrated a little around Breeze and a lot around Spark, specifically DataFrames and machine learning. However, there are a whole lot of other libraries, both in Java and Scala that could be leveraged while analyzing data from Scala. This chapter goes a little more into Spark's other components, streaming and GraphX. Note that each recipe in this chapter feeds into the next recipe.

Note

All the code related to this chapter can be downloaded from https://github.com/arunma/ScalaDataAnalysisCookbook/tree/master/chapter7-goingfurther.

Using Spark Streaming to subscribe to a Twitter stream


Just like all the other components of Spark, Spark Streaming is also scalable and fault-tolerant, it's just that it manages a stream of data instead of a large amount of data that Spark generally does. The way that Spark Streaming approaches streaming is unique in the sense that it accumulates streams into small batches called DStreams and then processes them as mini-batches, an approach usually called micro-batching. The component that receives the stream of data and splits it into time-bound windows of batches is called the receiver.

Once these batches are received, Spark takes these batches up, converts them into RDDs, and processes the RDDs in the same way as static datasets. The regular framework components such as the driver and executor stay the same. However, in terms of Spark Streaming, a DStream or Discretized stream is just a continuous stream of RDDs. Also, just like SQLContext served as an entry point to use SQL in Spark...

Using Spark as an ETL tool


In the previous recipe, we subscribed to a Twitter stream and stored it in ElasticSearch. Another common source of streaming is Kafka, a distributed message broker. In fact, it's a distributed log of messages, which in simple terms means that there can be multiple brokers that has the messages partitioned among them.

In this recipe, we'll be subscribing the data that we ingested into ElasticSearch in the previous recipe and publishing the messages into Kafka. Soon after we publish the data to Kafka, we'll be subscribing to Kafka using the Spark Stream API. While this is a recipe that demonstrates treating ElasticSearch data as an RDD and publishing to Kafka using a KryoSerializer, the true intent of this recipe is to run a streaming classification algorithm against Twitter, which is our next recipe.

How to do it...

Let's look at the various steps involved in doing this.

  1. Setting up Kafka: This recipe uses Kafka version 0.8.2.1 for Spark 2.10, which can be downloaded...

Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream


In the previous recipe, we published all the tweets that were stored in ElasticSearch to a Kafka topic. In this recipe, we'll subscribe to the Kafka stream and train a classification model out of it. We will later use this trained model to classify a live Twitter stream.

How to do it...

This is a really small recipe that is composed of 3 steps:

  1. Subscribing to a Kafka stream: There are two ways to subscribe to a Kafka stream and we'll be using the DirectStream method, which is faster. Just like Twitter streaming, Spark has first-class support for subscribing to a Kafka stream. This is achieved by adding the spark-streaming-kafka dependency. Let's add it to our build.sbt file:

    "org.apache.spark" %% "spark-streaming-kafka" % sparkVersion

    The subscription process is more or less the reverse of the publishing process even in terms of the properties that we pass to Kafka:

    val topics = Set("twtopic")
    val kafkaParams...

Using GraphX to analyze Twitter data


GraphX is Spark's approach to graphs and computation against graphs. In this recipe, we will see a preview of what is possible with the GraphX component in Spark.

How to do it...

Now that we have the Twitter data stored in the ElasticSearch index, we will perform the following tasks on this data using a graph:

  1. Convert the ElasticSearch data into a Spark Graph.

  2. Sample vertices, edges, and triplets in the graph.

  3. Find the top group of connected hashtags (connected component).

  4. List all the hashtags in that component.

  1. Converting the ElasticSearch data into a graph: This involves two steps:

    1. Converting ElasticSearch data into a DataFrame: This step, like we saw in an earlier recipe, is just a one-liner:

      def convertElasticSearchDataToDataFrame(sqlContext: SQLContext) = {
          val twStatusDf = sqlContext.esDF("spark/twstatus")
          twStatusDf
      }
    2. Converting DataFrame to a graph: Spark Graph construction requires an RDD for a vertex and an RDD of edges. Let's construct them...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scala Data Analysis Cookbook
Published in: Oct 2015Publisher: ISBN-13: 9781784396749
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Arun Manivannan

Arun Manivannan has been an engineer in various multinational companies, tier-1 financial institutions, and start-ups, primarily focusing on developing distributed applications that manage and mine data. His languages of choice are Scala and Java, but he also meddles around with various others for kicks. He blogs at http://rerun.me. Arun holds a master's degree in software engineering from the National University of Singapore. He also holds degrees in commerce, computer applications, and HR management. His interests and education could probably be a good dataset for clustering.
Read more about Arun Manivannan