Packt+ | Advance your knowledge in tech

You're reading from Fast Data Processing Systems with SMACK Stack

Product type Book

Published in Dec 2016

Publisher Packt

ISBN-13 9781786467201

Pages 376 pages

Edition 1st Edition

Languages

Scala

Concepts

Data Processing

Author (1):

Raúl Estrada

Table of Contents (15) Chapters

Fast Data Processing Systems with SMACK Stack

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. An Introduction to SMACK

2. The Model - Scala and Akka

3. The Engine - Apache Spark

4. The Storage - Apache Cassandra

5. The Broker - Apache Kafka

6. The Manager - Apache Mesos

7. Study Case 1 - Spark and Cassandra

8. Study Case 2 - Connectors

9. Study Case 3 - Mesos and Docker

Chapter 8. Study Case 2 - Connectors

In this chapter, we analyze the Connectors, that is, the software pieces that enable SMACK stack technologies to communicate with each other. The relationship between Spark and Kafka, was covered in the Kafka chapter, and we also dealt with the relationship between Spark and Cassandra in the previous study case in Chapter 7, Study Case 1 - Spark and Cassandra.

This chapter has the following sections, along with the remaining relationships:

Akka and Cassandra
Akka and Spark
Kafka and Akka
Kafka and Cassandra

Akka and Cassandra

For this example, we will use the DataStax Cassandra driver and Akka to build an application that downloads tweets and then stores their ID, text, name, and date in a Cassandra table. Here we will see:

How to build a simple Akka application with just a few actors
How to use Akka IO to make HTTP requests
How to store the data in Cassandra

The first step is to build our core example. It contains three actors: two actors interact with the database and one actor downloads the tweets. The TwitterReadActor reads from Cassandra, the TweetWriteActor writes to Cassandra, and the TweetScanActor downloads the tweets and passes them to the TweetWriteActor to be written to Cassandra:

class TweetReadActor(cluster: Cluster) extends Actor { ... } 
 
class TweetWriterActor(cluster: Cluster) extends Actor { ... } 
 
class TweetScanActor(tweetWrite: ActorRef, queryUrl: String => String) extends Actor {  ... }

Figure 8-1 shows the relationship between the Twitter downloader...

Akka and Spark

We start developing the Spark Streaming application by creating a SparkConf followed by a StreamingContext:

val conf = new SparkConf(false) 
  .setMaster("local[*]") 
  .setAppName("Spark Streaming with Akka") 
  .set("spark.logConf", "true") 
  .set("spark.driver.port", s"$driverPort") 
  .set("spark.driver.host", s"$driverHost") 
  .set("spark.akka.logLifecycleEvents", "true") 
val ssc = new StreamingContext(conf, Seconds(1))

This gives us a context to access the actor system that is of the type ReceiverInputDStream:

val actorName = "salutator" 
val actorStream: ReceiverInputDStream[String] = ssc.actorStream[String](Props[Salutator], actorName)

Now that we have a DStream, let's define a high-level processing pipeline in Spark Streaming:

actorStream.print()

In the preceding case, the print() method is going to print the first 10 elements of each RDD generated in this DStream. Nothing happens until start() is executed...

Kafka and Akka

The connector is available at Maven Central for Scala 2.11 in the coordinates:

libraryDependencies += "com.typesafe.akka" %% "akka-stream-kafka" % "0.11-M4"

If you remember, a producer publishes messages to Kafka topics. The message itself contains information about what topic and partition to publish, so one can publish to different topics with the same producer. The underlying implementation uses the KafkaProducer.

When creating a consumer stream we need to pass ProducerSettings defining:

Kafka cluster bootstrap servers
Serializers for the keys and values
Tuning parameters

Here we have a ProducerSettings example:

import akka.kafka._ 
import akka.kafka.scaladsl._ 
import org.apache.kafka.common.serialization.StringSerializer 
import org.apache.kafka.common.serialization.ByteArraySerializer 
 
val producerSettings = ProducerSettings(system, new ByteArraySerializer, new StringSerializer).withBootstrapServers("localhost:9092")

The easiest way to publish...

Kafka and Cassandra

We need to use the kafka-connect-cassandra which is published on Maven Central by Tuplejump.

It can be defined as a dependency in the build file. For example, with SBT:

libraryDependencies += "com.tuplejump" %% "kafka-connect-cassandra" % "0.0.7"

This code polls Cassandra with a specific query. Using this, data can be fetched from Cassandra in two modes:

Bulk
Timestamp based

The modes change automatically based on the query. For example:

Bulk:

SELECT * FROM userlog;

Timestamp based:

SELECT * FROM userlog WHERE ts > previousTime(); 
SELECT * FROM userlog WHERE ts = currentTime(); 
SELECT * FROM userlog WHERE ts >= previousTime() AND  ts <= currentTime() ;

Here, previousTime() and currentTime() are replaced before fetching the data.

CQL Type	Schema Type
ASCII	STRING
VARCHAR	STRING
TEXT	STRING
BIGINT	INT64
COUNTER	INT64
BOOLEAN	BOOLEAN
DECIMAL	FLOAT64
DOUBLE	FLOAT64
FLOAT	FLOAT32
TIMESTAMP	TIMESTAMP

Table...

Summary

We have reviewed the connectors between all the SMACK stack technologies. The Spark and Kafka connection was explained in the Chapter 5, The Broker - Apache Kafka and the Spark and Cassandra connector was explained in the Chapter 7, Study Case 1 - Spark and Cassandra.

In this chapter, we reviewed the connectors between:

Akka and Cassandra
Akka and Spark
Kafka and Akka
Kafka and Cassandra

In the following chapter, we will review container technologies.

The rest of the chapter is locked

You have been reading a chapter from

Fast Data Processing Systems with SMACK Stack

Published in: Dec 2016 Publisher: Packt ISBN-13: 9781786467201

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime}

Authors (1)

Raúl Estrada

Raúl Estrada has been a programmer since 1996 and a Java developer since 2001. He loves all topics related to computer science. With more than 15 years of experience in high-availability and enterprise software, he has been designing and implementing architectures since 2003. His specialization is in systems integration, and he mainly participates in projects related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys web, mobile, and game programming. Raúl is a supporter of free software and enjoys experimenting with new technologies, frameworks, languages, and methods. Raúl is the author of other Packt Publishing titles, such as Fast Data Processing Systems with SMACK and Apache Kafka Cookbook.

See other products by Raúl Estrada