Reader small image

You're reading from  Scala and Spark for Big Data Analytics

Product typeBook
Published inJul 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785280849
Edition1st Edition
Languages
Concepts
Right arrow
Authors (2):
Md. Rezaul Karim
Md. Rezaul Karim
author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

Sridhar Alla
Sridhar Alla
author image
Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla

View More author details
Right arrow

My Name is Bayes, Naive Bayes

"Prediction is very difficult, especially if it's about the future"

-Niels Bohr

Machine learning (ML) in combination with big data is a radical combination that has created some great impacts in the field of research in Academia and Industry. Moreover, many research areas are also entering into big data since datasets are being generated and produced in an unprecedented way from diverse sources and technologies, commonly referred as the Data Deluge. This imposes great challenges on ML, data analytics tools, and algorithms to find the real VALUE out of big data criteria such as volume, velocity, and variety. However, making predictions from these huge dataset has never been easy.

Considering this challenge, in this chapter we will dig deeper into ML and find out how to use a simple yet powerful method to build a scalable classification...

Multinomial classification

In ML, multinomial (also known as multiclass) classification is the task of classifying data objects or instances into more than two classes, that is, having more than two labels or classes. Classifying data objects or instances into two classes is called binary classification. More technically, in multinomial classification, each training instance belongs to one of N different classes subject to N >=2. The goal is then to construct a model that correctly predicts the classes to which the new instances belong. There may be numerous scenarios having multiple categories in which the data points belong. However, if a given point belongs to multiple categories, this problem decomposes trivially into a set of unlinked binary problems, which can be solved naturally using a binary classification algorithm.

Readers are suggested not be confused distinguishing...

Bayesian inference

In this section, we will briefly discuss Bayesian inference (BI) and its underlying theory. Readers will be familiar with this concept from the theoretical and computational viewpoints.

An overview of Bayesian inference

Bayesian inference is a statistical method based on Bayes theorem. It is used to update the probability of a hypothesis (as a strong statistical proof) so that statistical models can repeatedly update towards more accurate learning. In other words, all types of uncertainty are revealed in terms of statistical probability in the Bayesian inference approach. This is an important technique in theoretical as well as mathematical statistics. We will discuss the Bayes theorem broadly in a later...

Naive Bayes

In ML, Naive Bayes (NB) is an example of the probabilistic classifier based on the well-known Bayes' theorem with strong independence assumptions between the features. We will discuss Naive Bayes in detail in this section.

An overview of Bayes' theorem

In probability theory, Bayes' theorem describes the probability of an event based on a prior knowledge of conditions that is related to that certain event. This is a theorem of probability originally stated by the Reverend Thomas Bayes. In other words, it can be seen as a way of understanding how the probability theory is true and affected by a new piece of information. For example, if cancer is related to age, the information about age can be used...

The decision trees

In this section, we will discuss the DT algorithm in detail. A comparative analysis of Naive Bayes and DT will be discussed too. DTs are commonly considered as a supervised learning technique used for solving classification and regression tasks. A DT is simply a decision support tool that uses a tree-like graph (or a model of decisions) and their possible consequences, including chance event outcomes, resource costs, and utility. More technically, each branch in a DT represents a possible decision, occurrence, or reaction in terms of statistical probability.

Compared to Naive Bayes, DT is a far more robust classification technique. The reason is that at first DT splits the features into training and test set. Then it produces a good generalization to infer the predicted labels or classes. Most interestingly, DT algorithm can handle both binary and multiclass...

Summary

In this chapter, we discussed some advanced algorithms in ML and found out how to use a simple yet powerful method of Bayesian inference to build another kind of classification model, multinomial classification algorithms. Moreover, the Naive Bayes algorithm was discussed broadly from the theoretical and technical perspectives. At the last pace, a comparative analysis between the DT and Naive Bayes algorithms was discussed and a few guidelines were provided.

In the next chapter, we will dig even deeper into ML and find out how we can take advantage of ML to cluster records belonging to a dataset of unsupervised observations.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scala and Spark for Big Data Analytics
Published in: Jul 2017Publisher: PacktISBN-13: 9781785280849
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

author image
Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla