Reader small image

You're reading from  Big Data Analytics with Java

Product typeBook
Published inJul 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787288980
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
RAJAT MEHTA
RAJAT MEHTA
author image
RAJAT MEHTA

The author is a VP (Technical Architect) in technology in JP Morgan Chase in New York. The author is a sun certified java developer and has worked on java related technologies for more than 16 years. Current role for the past few years heavily involves the usage of bid data stack and running analytics on it. Author is also a contributor in various open source projects that are available on his GitHub repository and is also a frequent write on dev magazines.
Read more about RAJAT MEHTA

Right arrow

Chapter 6. Naive Bayes and Sentiment Analysis

A few years back one of my friends and I built a forum where developers could post useful tips regarding the technology they were using. I wished I knew about the Naive Bayes machine learning algorithm then. It could have helped me to filter objectionable content that was posted on that forum. In the previous chapter, we saw two algorithms that can be used to predict continuous values or to classify between discrete sets of values. Both the approaches predicted a definite value (whether it was continuous or discrete), but they did not give us a probability of occurrences of our best guesses. Naive Bayes gives us the predicted results with a probability attached to it, so in a set of results for same category we can pick the one with the highest probability.

In this chapter, we will cover:

  • General concepts about probability and conditional probability. This section will be basic and users who already know this can skip this section.

  • We will cover...

Conditional probability


Conditional probability in simple terms is the probability of occurrence of an event given that another event has already occurred. It is given by the following formula:

P(B|A)= P(A and B)/P(A)

Here in this formula the values stand for:

Probability value

Description

P(B|A)

This is the probability of occurrence of event B given that event A has already occurred.

P(A and B)

The probability that both event A and B occur.

P(A)

This is the probability of occurrence of an event A.

Now let's try to understand this using an example. Suppose we have a set of seven figures as follows:

As seen in the preceding figure, we have three triangles and four rectangles. So if we randomly pull one figure from this set the probability that it belongs to either of the figures will be:

P(triangle) = Number of Triangles / Total number of figures = 3 / 7

P(rectangle) = Number of rectangles / Total number of figures = 4 / 7

Now suppose we break the figure into two individual sets...

Bayes theorem


The Bayes theorem is based on the concept of learning from experience, that is, using a sequence of steps to come to a prediction. It is the calculation of probability based on prior knowledge of occurrences that might have led to the event. Bayes theorem is given by the following formula:

Where:

Probability Value

Description

P(A | B)

Conditional probability of event A given that event B has occurred.

P(B | A)

Conditional probability of event B given that event A has occurred.

P(A)

Individual probability of event A without regard to event B.

P(B)

Individual probability of event B without regard to event A.

Let's understand this using the same example as we used previously. Suppose we picked one green triangle randomly from a set then what is the probability that it came from Set-1?

Before we run the bayes theorem formula we will first calculate the individual probabilities:

  • Probability of randomly picking a set from one of the two sets, Set-1 and Set-2

    Since there...

Naive Bayes algorithm


Have you ever wondered how your Gmail application automatically figures out that a certain message that you have received is spam and automatically puts it in the spam folder? Behind the email spam detector, a powerful machine learning algorithm is running, that automatically detects whether a particular email that you have received is spam or useful. This useful algorithm that runs behind the scenes and saves you wasted hours on deleting or checking these spam emails is Naive Bayes. As the name suggests, the algorithm is based on the bayes theorem. The algorithm is simple yet powerful, from the perspective of classification the algorithm figures out the probability of occurrence of each discrete class and it picks the value with the highest probability.

You might have wondered why the algorithm carries the word Naive in its name. It's because the algorithm makes some Naive assumptions that the features that are present in a dataset are independent of each other. Suppose...

Sentimental analysis


As we showed in the previous examples, Naive Bayes has extensive usage in text analysis.

One of the forms of text analysis is sentimental analysis. As the name suggests this technique is used to figure out the sentiment or emotion associated with the underlying text. So if you have a piece of text and you want to understand what kind of emotion it conveys, for example, anger, love, hate, positive, negative, and so on you can use the technique sentimental analysis. Sentimental analysis is used in various places, for example:

  • To analyze the reviews of a product whether they are positive or negative

  • This can be especially useful to predict how successful your new product is by analyzing user feedback

  • To analyze the reviews of a movie to check if it's a hit or a flop

  • Detecting the use of bad language (such as heated language, negative remarks, and so on) in forums, emails, and social media

  • To analyze the content of tweets or information on other social media to check if a political...

SVM or Support Vector Machine


This is another popular algorithm that is used in many real life applications like text categorization, image classification, sentiment analysis and handwritten digit recognition. Support vector machine algorithm can be used both for classification as well as for regression. Spark has the implementation for linear SVM which is a binary classifier. If the datapoints are plotted on a chart the SVM algorithm creates a hyperplane between the datapoints. The algorithm finds the closest points with different labels within the dataset and it plots the hyperplane between those points. The location of the hyperplane is such that it is at maximum distance from these closest points, this way the hyperplane would nicely bifurcate the data. To figure out this maximum distance for the location of the hyperplane the SVM algorithm uses a kernel function (mathematical function).

As you can see in the image we have two different type of datapoints one clustered on the X2 axis...

Summary


This chapter covered a lot of ground on two important topics. Firstly, we covered a popular probabilistic algorithm, Naive Bayes, and explained its concepts and showed how it uses bayes rule and conditional probability to make predictions about new data using a pre-trained model. We also mentioned why Naive Bayes is called Naive as it makes a Naive assumption that all its features are completely independent of each other, thereby occurrence of one feature does not impact the other in any way. Despite this it forms well as we saw in our sample application. In our sample application we learnt a technique called sentimental analysis for figuring out the opinion whether positive or negative from a piece of text.

In the next chapter, we will study another popular machine learning algorithm called decision tree. We will show how it is very similar to a flowchart and we will explain it using a sample loan approval application.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Big Data Analytics with Java
Published in: Jul 2017Publisher: PacktISBN-13: 9781787288980
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
RAJAT MEHTA

The author is a VP (Technical Architect) in technology in JP Morgan Chase in New York. The author is a sun certified java developer and has worked on java related technologies for more than 16 years. Current role for the past few years heavily involves the usage of bid data stack and running analytics on it. Author is also a contributor in various open source projects that are available on his GitHub repository and is also a frequent write on dev magazines.
Read more about RAJAT MEHTA