Reader small image

You're reading from  Scala and Spark for Big Data Analytics

Product typeBook
Published inJul 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785280849
Edition1st Edition
Languages
Concepts
Right arrow
Authors (2):
Md. Rezaul Karim
Md. Rezaul Karim
author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

Sridhar Alla
Sridhar Alla
author image
Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla

View More author details
Right arrow

Learning Machine Learning - Spark MLlib and Spark ML

"Each of us, actually every animal, is a data scientist. We collect data from our sensors, and then we process the data to get abstract rules to perceive our environment and control our actions in that environment to minimize pain and/or maximize pleasure. We have memory to store those rules in our brains, and then we recall and use them when needed. Learning is lifelong; we forget rules when they no longer apply or revise them when the environment changes."

- Ethem Alpaydin, Machine Learning: The New AI

The purpose of this chapter is to provide a conceptual introduction to statistical machine learning (ML) techniques for those who might not normally be exposed to such approaches during their typical required statistical training. This chapter also aims to take a newcomer from having minimal knowledge of machine learning...

Introduction to machine learning

In this section, we will try to define machine learning from computer science, statistics, and data analytical perspectives. Machine learning (ML) is the branch of computer science that provides the computers the ability to learn without being explicitly programmed (Arthur Samuel in 1959). This field of study being evolved from the study of pattern recognition and computational learning theory in artificial intelligence.

More specifically, ML explores the study and construction of algorithms that can learn from heuristics and make predictions on data. This kind of algorithms overcome the strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs. Now let's more explicit and versatile definition from Prof. Tom M. Mitchell, who explained what machine learning really means...

Spark machine learning APIs

In this section, we will describe two key concepts introduced by the Spark machine learning libraries (Spark MLlib and Spark ML) and the most widely used implemented algorithms that align with the supervised and unsupervised learning techniques we discussed in the previous sections.

Spark machine learning libraries

As already stated, in the pre-Spark era, big data modelers typically used to build their ML models using statistical languages such as R, STATA, and SAS. However, this kind of workflow (that is, the execution flow of these ML algorithms) lacks efficiency, scalability, and throughput, as well as accuracy, with, of course, extended execution times.

Then, data engineers used to reimplement...

Feature extraction and transformation

Suppose you are going to build a machine learning model that will predict whether a credit card transaction is fraudulent or not. Now, based on the available background knowledge and data analysis, you might decide which data fields (aka features) are important for training your model. For example, amount, customer name, buying company name, and the address of the credit card owners are worth to providing for the overall learning process. These are important to consider since, if you just provide a randomly generated transaction ID, that will not carry any information so would not be useful at all. Thus, once you have decided which features to include in your training set, you then need to transform those features to train the model for better learning. The feature transformations help you add additional background information to the training...

Creating a simple pipeline

Spark provides pipeline APIs under Spark ML. A pipeline comprises a sequence of stages consisting of transformers and estimators. There are two basic types of pipeline stages, called transformer and estimator:

  • A transformer takes a dataset as an input and produces an augmented dataset as the output so that the output can be fed to the next step. For example, Tokenizer and HashingTF are two transformers. Tokenizer transforms a dataset with text into a dataset with tokenized words. A HashingTF, on the other hand, produces the term frequencies. The concept of tokenization and HashingTF is commonly used in text mining and text analytics.
  • On the contrary, an estimator must be the first on the input dataset to produce a model. In this case, the model itself will be used as the transformer for transforming the input dataset into the augmented output dataset...

Unsupervised machine learning

In this section, to make the discussion concrete, only the dimensionality reduction using PCA and the LDA for topic modeling will be discussed for text clustering. Other algorithms for unsupervised learning will be discussed in Chapter 13, My Name is Bayes, Naive Bayes with some practical examples.

Dimensionality reduction

Dimensionality reduction is the process of reducing the number of variables under consideration. It can be used to extract latent features from raw and noisy features or to compress data while maintaining the structure. Spark MLlib provides support for dimensionality reduction on the RowMatrix class. The most commonly used algorithms for reducing the dimensionality of data are...

Binary and multiclass classification

Binary classifiers are used to separate the elements of a given dataset into one of two possible groups (for example, fraud or not fraud) and are a special case of multiclass classification. Most binary classification metrics can be generalized to multiclass classification metrics. A multiclass classification describes a classification problem, where there are M>2 possible labels for each data point (the case where M=2 is the binary classification problem).

For multiclass metrics, the notion of positives and negatives is slightly different. Predictions and labels can still be positive or negative, but they must be considered in the context of a particular class. Each label and prediction takes on the value of one of the multiple classes and so they are said to be positive for their particular class and negative for all other classes. So...

Summary

In this chapter, we had a brief introduction to the topic and got a grasp of simple, yet powerful and common ML techniques. Finally, you saw how to build your own predictive model using Spark. You learned how to build a classification model, how to use the model to make predictions, and finally, how to use common ML techniques such as dimensionality reduction and One-Hot Encoding.

In the later sections, you saw how to apply the regression technique to high-dimensional datasets. Then, you saw how to apply a binary and multiclass classification algorithm for predictive analytics. Finally, you saw how to achieve outstanding classification accuracy using a random forest algorithm. However, we have other topics in machine learning that need to be covered too, for example, recommendation systems and model tuning for even more stable performance before you finally deploy the...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scala and Spark for Big Data Analytics
Published in: Jul 2017Publisher: PacktISBN-13: 9781785280849
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

author image
Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla