Chapter 8. Study Case 2 - Connectors
In this chapter, we analyze the Connectors, that is, the software pieces that enable SMACK stack technologies to communicate with each other. The relationship between Spark and Kafka, was covered in the Kafka chapter, and we also dealt with the relationship between Spark and Cassandra in the previous study case in Chapter 7, Study Case 1 - Spark and Cassandra.
This chapter has the following sections, along with the remaining relationships:
- Akka and Cassandra
- Akka and Spark
- Kafka and Akka
- Kafka and Cassandra
For this example, we will use the DataStax Cassandra driver and Akka to build an application that downloads tweets and then stores their ID, text, name, and date in a Cassandra table. Here we will see:
- How to build a simple Akka application with just a few actors
- How to use Akka IO to make HTTP requests
- How to store the data in Cassandra
The first step is to build our core example. It contains three actors: two actors interact with the database and one actor downloads the tweets. The TwitterReadActor
reads from Cassandra, the TweetWriteActor
writes to Cassandra, and the TweetScanActor
downloads the tweets and passes them to the TweetWriteActor
to be written to Cassandra:
class TweetReadActor(cluster: Cluster) extends Actor { ... }
class TweetWriterActor(cluster: Cluster) extends Actor { ... }
class TweetScanActor(tweetWrite: ActorRef, queryUrl: String => String) extends Actor { ... }
Figure 8-1 shows the relationship between the Twitter downloader...
We start developing the Spark Streaming application by creating a SparkConf
followed by a StreamingContext
:
val conf = new SparkConf(false)
.setMaster("local[*]")
.setAppName("Spark Streaming with Akka")
.set("spark.logConf", "true")
.set("spark.driver.port", s"$driverPort")
.set("spark.driver.host", s"$driverHost")
.set("spark.akka.logLifecycleEvents", "true")
val ssc = new StreamingContext(conf, Seconds(1))
This gives us a context to access the actor system that is of the type ReceiverInputDStream
:
val actorName = "salutator"
val actorStream: ReceiverInputDStream[String] = ssc.actorStream[String](Props[Salutator], actorName)
Now that we have a DStream, let's define a high-level processing pipeline in Spark Streaming:
actorStream.print()
In the preceding case, the print()
method is going to print the first 10 elements of each RDD generated in this DStream. Nothing happens until start()
is executed...
The connector is available at Maven Central for Scala 2.11 in the coordinates:
libraryDependencies += "com.typesafe.akka" %% "akka-stream-kafka" % "0.11-M4"
If you remember, a producer publishes messages to Kafka topics. The message itself contains information about what topic and partition to publish, so one can publish to different topics with the same producer. The underlying implementation uses the KafkaProducer
.
When creating a consumer stream we need to pass ProducerSettings
defining:
- Kafka cluster bootstrap servers
- Serializers for the keys and values
- Tuning parameters
Here we have a ProducerSettings
example:
import akka.kafka._
import akka.kafka.scaladsl._
import org.apache.kafka.common.serialization.StringSerializer
import org.apache.kafka.common.serialization.ByteArraySerializer
val producerSettings = ProducerSettings(system, new ByteArraySerializer, new StringSerializer).withBootstrapServers("localhost:9092")
The easiest way to publish...
We need to use the kafka-connect-cassandra
which is published on Maven Central by Tuplejump.
It can be defined as a dependency in the build file. For example, with SBT:
libraryDependencies += "com.tuplejump" %% "kafka-connect-cassandra" % "0.0.7"
This code polls Cassandra with a specific query. Using this, data can be fetched from Cassandra in two modes:
The modes change automatically based on the query. For example:
Bulk:
SELECT * FROM userlog;
Timestamp based:
SELECT * FROM userlog WHERE ts > previousTime();
SELECT * FROM userlog WHERE ts = currentTime();
SELECT * FROM userlog WHERE ts >= previousTime() AND ts <= currentTime() ;
Here, previousTime()
and currentTime()
are replaced before fetching the data.
Table...
We have reviewed the connectors between all the SMACK stack technologies. The Spark and Kafka connection was explained in the Chapter 5, The Broker - Apache Kafka and the Spark and Cassandra connector was explained in the Chapter 7, Study Case 1 - Spark and Cassandra.
In this chapter, we reviewed the connectors between:
- Akka and Cassandra
- Akka and Spark
- Kafka and Akka
- Kafka and Cassandra
In the following chapter, we will review container technologies.