Packt+ | Advance your knowledge in tech

You're reading from Big Data Analytics with Java

Product typeBook

Published inJul 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781787288980

Edition1st Edition

Languages

Java

Tools

Apache Spark Hadoop

Concepts

Big Data

Author (1)

RAJAT MEHTA

Chapter 6. Naive Bayes and Sentiment Analysis

A few years back one of my friends and I built a forum where developers could post useful tips regarding the technology they were using. I wished I knew about the Naive Bayes machine learning algorithm then. It could have helped me to filter objectionable content that was posted on that forum. In the previous chapter, we saw two algorithms that can be used to predict continuous values or to classify between discrete sets of values. Both the approaches predicted a definite value (whether it was continuous or discrete), but they did not give us a probability of occurrences of our best guesses. Naive Bayes gives us the predicted results with a probability attached to it, so in a set of results for same category we can pick the one with the highest probability.

In this chapter, we will cover:

General concepts about probability and conditional probability. This section will be basic and users who already know this can skip this section.
We will cover...

Conditional probability

Conditional probability in simple terms is the probability of occurrence of an event given that another event has already occurred. It is given by the following formula:

P(B|A)= P(A and B)/P(A)

Here in this formula the values stand for:

Probability value	Description
P(B\|A)	This is the probability of occurrence of event B given that event A has already occurred.
P(A and B)	The probability that both event A and B occur.
P(A)	This is the probability of occurrence of an event A.

Now let's try to understand this using an example. Suppose we have a set of seven figures as follows:

As seen in the preceding figure, we have three triangles and four rectangles. So if we randomly pull one figure from this set the probability that it belongs to either of the figures will be:

P(triangle) = Number of Triangles / Total number of figures = 3 / 7

P(rectangle) = Number of rectangles / Total number of figures = 4 / 7

Now suppose we break the figure into two individual sets...

Bayes theorem

The Bayes theorem is based on the concept of learning from experience, that is, using a sequence of steps to come to a prediction. It is the calculation of probability based on prior knowledge of occurrences that might have led to the event. Bayes theorem is given by the following formula:

Where:

Probability Value	Description
P(A \| B)	Conditional probability of event A given that event B has occurred.
P(B \| A)	Conditional probability of event B given that event A has occurred.
P(A)	Individual probability of event A without regard to event B.
P(B)	Individual probability of event B without regard to event A.

Let's understand this using the same example as we used previously. Suppose we picked one green triangle randomly from a set then what is the probability that it came from Set-1?

Before we run the bayes theorem formula we will first calculate the individual probabilities:

Probability of randomly picking a set from one of the two sets, Set-1 and Set-2
Since there...

Naive Bayes algorithm

Have you ever wondered how your Gmail application automatically figures out that a certain message that you have received is spam and automatically puts it in the spam folder? Behind the email spam detector, a powerful machine learning algorithm is running, that automatically detects whether a particular email that you have received is spam or useful. This useful algorithm that runs behind the scenes and saves you wasted hours on deleting or checking these spam emails is Naive Bayes. As the name suggests, the algorithm is based on the bayes theorem. The algorithm is simple yet powerful, from the perspective of classification the algorithm figures out the probability of occurrence of each discrete class and it picks the value with the highest probability.

You might have wondered why the algorithm carries the word Naive in its name. It's because the algorithm makes some Naive assumptions that the features that are present in a dataset are independent of each other. Suppose...

Sentimental analysis

As we showed in the previous examples, Naive Bayes has extensive usage in text analysis.

One of the forms of text analysis is sentimental analysis. As the name suggests this technique is used to figure out the sentiment or emotion associated with the underlying text. So if you have a piece of text and you want to understand what kind of emotion it conveys, for example, anger, love, hate, positive, negative, and so on you can use the technique sentimental analysis. Sentimental analysis is used in various places, for example:

To analyze the reviews of a product whether they are positive or negative
This can be especially useful to predict how successful your new product is by analyzing user feedback
To analyze the reviews of a movie to check if it's a hit or a flop
Detecting the use of bad language (such as heated language, negative remarks, and so on) in forums, emails, and social media
To analyze the content of tweets or information on other social media to check if a political...

SVM or Support Vector Machine

This is another popular algorithm that is used in many real life applications like text categorization, image classification, sentiment analysis and handwritten digit recognition. Support vector machine algorithm can be used both for classification as well as for regression. Spark has the implementation for linear SVM which is a binary classifier. If the datapoints are plotted on a chart the SVM algorithm creates a hyperplane between the datapoints. The algorithm finds the closest points with different labels within the dataset and it plots the hyperplane between those points. The location of the hyperplane is such that it is at maximum distance from these closest points, this way the hyperplane would nicely bifurcate the data. To figure out this maximum distance for the location of the hyperplane the SVM algorithm uses a kernel function (mathematical function).

As you can see in the image we have two different type of datapoints one clustered on the X2 axis...

Summary

This chapter covered a lot of ground on two important topics. Firstly, we covered a popular probabilistic algorithm, Naive Bayes, and explained its concepts and showed how it uses bayes rule and conditional probability to make predictions about new data using a pre-trained model. We also mentioned why Naive Bayes is called Naive as it makes a Naive assumption that all its features are completely independent of each other, thereby occurrence of one feature does not impact the other in any way. Despite this it forms well as we saw in our sample application. In our sample application we learnt a technique called sentimental analysis for figuring out the opinion whether positive or negative from a piece of text.

In the next chapter, we will study another popular machine learning algorithm called decision tree. We will show how it is very similar to a flowchart and we will explain it using a sample loan approval application.

The rest of the chapter is locked

You have been reading a chapter from

Big Data Analytics with Java

Published in: Jul 2017Publisher: PacktISBN-13: 9781787288980

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

RAJAT MEHTA

The author is a VP (Technical Architect) in technology in JP Morgan Chase in New York. The author is a sun certified java developer and has worked on java related technologies for more than 16 years. Current role for the past few years heavily involves the usage of bid data stack and running analytics on it. Author is also a contributor in various open source projects that are available on his GitHub repository and is also a frequent write on dev magazines.
Read more about RAJAT MEHTA

Other recommended products

Related to this chapter

Machine Learning with scikit-learn Quick Start Guide

Scikit-learn is a robust machine learning library for the Python programming language. It provides a set of supervised and unsupervised learning algorithms. This book is the easiest way to learn how to deploy, optimize and evaluate all the important machine learning algorithms that scikit-learn provides.

BookOct 2018172 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Mastering Machine Learning with Spark 2.x

The purpose of machine learning is to build systems that learn from data. With the meteoric rise of machine learning, developers are now keen on finding out how can they make their Spark applications smarter. The book commences by defining machine learning primitives by the MLlib and H2O libraries. You will learn how to use Binary classification to detect the Higgs Boson particle in the huge amount of data produced by CERN particle collider and classify daily health activities using ensemble Methods for Multi-Class Classification. Finally, you will build different pattern mining models using MLlib, perform complex manipulation of DataFrames using Spark and Spark SQL, and deploy your app in a Spark streaming environment.

BookAug 2017340 pages

Apache Spark 2.x Machine Learning Cookbook

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.

BookSep 2017666 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Mastering Apache Spark 2.x

Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and more. This book will familiarize you with the newest features in Apache Spark 2.x, and take you through an exciting journey of complex Big Data processing, analytics, streaming analytics as well as advanced machine learning with Apache Spark. During the course of the book, you will leverage different functionalities and modules of Apache Spark such as Spark SQL, Spark MLlib, Spark Streaming, SparkML and more, to build efficient data processing solutions. By the end of this book, you will have all the necessary knowledge to use Apache Spark effectively in your day to day tasks.

BookJul 2017354 pages

Hands-On Deep Learning with Apache Spark

Deep Learning is a subset of Machine Learning where data sets with several layers of complexity can be processed. This book teaches you the different techniques using which deep learning solutions can be implemented at scale, on Apache Spark. This will help you gain experience of implementing your deep learning models in many real-world use cases.

BookJan 2019322 pages

Machine Learning with Spark

Spark ML is the machine learning module of Spark. It uses in-memory RDDs to process machine learning models faster for clustering, classification, and regression.

BookApr 2017532 pages

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

BookApr 2019220 pages

Learning PySpark

This book will get you to grips with the Spark Python API. You’ll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

BookFeb 2017274 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages