Reader small image

You're reading from  Mastering Apache Storm

Product typeBook
Published inAug 2017
Reading LevelExpert
Publisher
ISBN-139781787125636
Edition1st Edition
Languages
Right arrow
Author (1)
Ankit Jain
Ankit Jain
author image
Ankit Jain

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE. He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Read more about Ankit Jain

Right arrow

Chapter 12. Twitter Tweet Collection and Machine Learning

In the previous chapter, we covered how we can create a log processing application with Storm and Kafka.

In this chapter, we are covering another important use case of Storm machine learning.

The following are the major topics covered in this chapter:

  • Exploring machine learning
  • Using Kafka producer to store the tweets in a Kafka cluster
  • Using Kafka Spout to read the data from Kafka
  • Using Storm Bolt to filter the tweets
  • Using Storm Bolt to calculate the sentiments of tweets
  • Deployment of topologies

Exploring machine learning


Machine learning is a branch of applied computer science in which we build models of real-world phenomenon based on existing data available for analysis, and then using that model, predicting certain characteristics of data never seen before by the model. Machine learning has become a very important component of real-time applications as decisions need to be made in real time.

Graphically, the process of machine learning can be represented by the following figure:

The process of building the model from data is called training in machine learning terminology. Training can happen in real time on a stream of data or it can be done on historical data. When the training is done in real time, the model evolves over time with the changed data. This kind of learning is referred to as online learning, and when the model is updated every once in a while, by running the training algorithm on a new set of data, it is called offline learning.

When we talk about machine learning...

Twitter sentiment analysis


We will be dividing the sentiments use case into two parts:

  • Collecting tweets from Twitter and storing them in Kafka
  • Reading the data from Kafka, calculating the sentiments, and storing them in HDFS

Using Kafka producer to store the tweets in a Kafka cluster

In this section, we are going to cover how we can stream the tweets from Twitter using the twitter streaming API. We are also going to cover how we can store the fetched tweets in Kafka for later processing through Storm.

We are assuming you already have a twitter account, and that the consumer key and access token are generated for your application. You can refer to: https://bdthemes.com/support/knowledge-base/generate-api-key-consumer-token-access-key-twitter-oauth/ to generate a consumer key and access token. Take the following steps:

  1. Create a new maven project with groupId, com.stormadvance and artifactId, kafka_producer_twitter.

 

  1. Add the following dependencies to the pom.xml file. We are adding the Kafka and...

Kafka spout, sentiments bolt, and HDFS bolt


In this section, we are going to write/configure a Kafka spout to consume the tweets from the Kafka cluster. We are going to use the open source Storm spout connectors for consuming the data from Kafka:

  1. Create a new maven project with the groupID as com.stormadvance and artifactId as Kafka_twitter_topology.
  2. Add the following maven dependencies to the pom.xml file:
   <dependencies> 
         <dependency> 
               <groupId>org.codehaus.jackson</groupId> 
               <artifactId>jackson-mapper-asl</artifactId> 
               <version>1.9.13</version> 
         </dependency> 
 
         <dependency> 
               <groupId>org.apache.hadoop</groupId> 
               <artifactId>hadoop-client</artifactId> 
               <version>2.2.0</version> 
               <exclusions> 
                     <exclusion> 
                        ...

Summary


In this section, we covered how we can read Twitter tweets using the Twitter streaming API, how we can process the tweets to calculate the tweet text from inputted JSON records, calculate the sentiments of the tweets, and store the final output in HDFS.

With this, we come to the end of this book. Over the course of this book, we have come a long way from taking our first steps with Apache Storm to developing real-world applications with it. Here, we would like to summarize everything that we have learned.

We introduced you to the basic concepts and components of Storm, and covered how we can write and deploy/run the topology in both local and clustered mode. We also walked through the basic commands of Storm, and covered how we can modify the parallelism of the Storm topology at runtime. We also dedicated an entire chapter to monitoring Storm, which is an area often neglected during development, but is a critical part of any production setting. You also learned about Trident, which...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Apache Storm
Published in: Aug 2017Publisher: ISBN-13: 9781787125636
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ankit Jain

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE. He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Read more about Ankit Jain