Processing Tweets with Apache Hive

Exclusive offer: get 50% off this eBook here
Talend for Big Data

Talend for Big Data — Save 50%

Access, transform, and integrate data using Talend's open source, extensible tools with this book and ebook

$20.99    $10.50
by Bahaaldine Azarmi | February 2014 | Open Source

In this article, by Bahaaldine Azarmi, the author of Talend for Big Data, we'll learn how to use tweets to highlight sentiments by performing the following actions:

  • Extracting hashtags and emoticons from tweets
  • Joining the newly created hashtags and emoticon tables to create sentiments

(For more resources related to this topic, see here.)

Extracting hashtags

In this part and the following one, we'll see how to extract data efficiently from tweets such as hashtags and emoticons. We need to do it because we want to be able to know what the most discussed topics are, and also get the mood across the tweets. And then, we'll want to join that information to get people's sentiments.

We'll start with hashtags; to do so, we need to do the following:

  1. Create a new hashtag table.
  2. Use a function that will extract the hashtags from the tweet string.
  3. Feed the hashtag table with the result of the extracted function.

So, I have some bad news and good news:

  • Bad news: Hive provides a lot of built-in user-defined functions, but unfortunately, it does not provide any function based on a regex pattern; we need to use a custom user-defined function to do that. This is such a bad news as you will learn how to do it.
  • Good news: Hive provides an extremely efficient way to create a Hive table from an array. We'll then use the lateral view and the Explode Hive UDF to do that.

The following is the Hive-processing workflow that we are going to apply to our tweets:

Hive-processing workflow

The preceding diagram describes the workflow to be followed to extract the hashtags. The steps are basically as follows:

  1. Receive the tweets.
  2. Detect all the hashtags using the custom Hive user-defined function.
  3. Obtain an array of hashtags.
  4. Explode it and obtain a lateral view to feed our hashtags table.

This kind of processing is really useful if we want to have a feeling of what the top tweeted topics are, and is most of the time represented by a word cloud chart like the one shown in the following diagram:

Topic word cloud sample

Let's do this by creating a new CH04_01_HIVE_PROCESSING_HASH_TAGS job under a new Chapter4 folder. This job will contain six components:

  • One to connect to Hive; you can easily copy and paste the connection component from the CH03_03_HIVE_FORMAT_DATA job
  • One tHiveRow to add the custom Hive UDF to the Hive runtime classpath.

The following would be the steps to create a new job:

  1. First, we will add the following context variable to our PacktContext group:

    Name

    Value

    custom_udf_jar

    PATH_TO_THE_JAR

    For Example: /Users/bahaaldine/here/is/the/jar/extractHashTags.jar

    This new context variable is just the path to the Hive UDF JAR file provided in the source file

  2. Now, we can add the "add jar "+context.custom_udf_jar Hive query in our tHiveRow component to load the JAR file in the classpath when the job is being run.
  3. We use the add jar query so that Hive will load all the classes in the JAR file when the job starts, as shown in the following screenshot:

    Adding a Custom UDF JAR to Hive classpath.

  4. After the JAR file is loaded by the previous component, we need tHiveRow to register the custom UDF into the available UDF catalog. The custom UDF is a Java class with a bunch of methods that can be invoked from Hive-QL code. The custom UDF that we need is located in the org.talend.demo package of the JAR file and is named ExtractPattern. So we will simply add the "create temporary function extract_patterns as 'org.talend.demo.ExtractPattern'" configuration to the component.

    We use the create temporary function query to create a new extract_patterns function in Hive UDF catalog and give the implementation class contained in our package

  5. We need one tHiveRow to drop the hashtags table if it exists. As we have done in the CH03_02_HIVE_CREATE_TWEET_TABLE job, just add the "DROP TABLE IF EXISTS hash_tags" drop statement to be sure that the table is removed when we relaunch the job.
  6. We need one tHiveRow to create the hashtags table. We are going to create a new table to store the hashtags. For the purpose of simplicity, we'll only store the minimum time and description information as shown in the following table:

    Name

    Value

    hash_tags_id

    String

    day_of_week

    String

    day_of_month

    String

    time

    String

    month

    String

    hash_tags_label

    String

  7. The essential information here is the hash_tags_label column, which contains the hashtag name. With this knowledge, the following is our create table query:

    CREATE EXTERNAL TABLE hash_tags ( hash_tags_id string, day_of_week string, day_of_month string, time string, month string, hash_tags_label string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' LOCATION '/user/"+context.hive_user+"/packt/chp04/hashtags'

  8. Finally we need a tHiveRow component to feed the hashtags table. Here, we are going to use all the assets provided by the previous components as shown in the following query:

    insert into table hash_tags select concat(formatted_tweets.day_of_week, formatted_tweets. day_of_month, formatted_tweets.time, formatted_tweets.month) as hash_id, formatted_tweets.day_of_week, formatted_tweets.day_of_month, formatted_tweets.time, formatted_tweets.month, hash_tags_label from formatted_tweets LATERAL VIEW explode( extract_patterns(formatted_tweets.content,'#(\\\\w+)') ) hashTable as hash_tags_label

Let's analyze the query from the end to the beginning. The last part of the query uses the extract_patterns function to parse in the formatted_tweets.content all hashtags based on the regex #(+).

In Talend, all strings are Java string objects. That's why we need here to escape all backslash. Hive also needs special character escape, that brings us to finally having four backslashes.

The extract_patterns command returns an array that we inject in the exploded Hive UDF in order to obtain a list of objects. We then pass them to the lateral view statement, which creates a new on-the-fly view called hashTable with one column hash_tags_label. Take a breath. We are almost done.

If we go one level up, we will see that we selected all the required columns for our new hash_tags table, do a concatenation of data to build hash_id, and dynamically select a runtime-built column called hash_tags_label provided by the lateral view.

Finally, all the selected data is inserted in the hash_tags table.

We just need to run the job, and then, using the following query, we will check in Hive if the new table contains our hashtags:

$ select * from hash_tags

The following diagram shows the complete hashtags-extracting job structure:

Hive processing job

Summary

By now, you should have a good overview of how to use Apache Hive features with Talend, from the ELT mode to the lateral view, passing by the custom Hive user-defined function. From the point of view of a use case, we have now reached the step where we need to reveal some added-value data from our Hive-based processing data.

Resources for Article:


Further resources on this subject:


Talend for Big Data Access, transform, and integrate data using Talend's open source, extensible tools with this book and ebook
Published: February 2014
eBook Price: $20.99
Book Price: $34.99
See more
Select your format and quantity:

About the Author :


Bahaaldine Azarmi

Bahaaldine Azarmi  is the cofounder of  reach5.co. With his past experience of working at Oracle and Talend, he has specialized in real-time architecture using service-oriented architecture products, Big Data projects, and web technologies.

Books From Packt


 Getting Started with Talend Open Studio for Data Integration
Getting Started with Talend Open Studio for Data Integration

Talend Open Studio Cookbook
Talend Open Studio Cookbook

 Scaling Big Data with Hadoop and Solr
Scaling Big Data with Hadoop and Solr

Big Data Analytics with R and Hadoop
Big Data Analytics with R and Hadoop

 Pentaho for Big Data Analytics
Pentaho for Big Data Analytics

Getting Started with Greenplum for Big Data Analytics
Getting Started with Greenplum for Big Data Analytics

 Implementing Splunk: Big Data Reporting and Development for Operational Intelligence
Implementing Splunk: Big Data Reporting and Development for Operational Intelligence

 Instant Apache Sqoop [Instant]
Instant Apache Sqoop [Instant]


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software