Packt+ | Advance your knowledge in tech

You're reading from Mastering Hadoop 3

Product type Book

Published in Feb 2019

Publisher Packt

ISBN-13 9781788620444

Pages 544 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Authors (2):

Chanchal Singh

Manish Kumar

View More author details

Table of Contents (23) Chapters

Title Page

Dedication

About Packt

Foreword

Contributors

Preface

Journey to Hadoop 3

Deep Dive into the Hadoop Distributed File System

YARN Resource Management in Hadoop

Internals of MapReduce

SQL on Hadoop

Real-Time Processing Engines

Widely Used Hadoop Ecosystem Components

Designing Applications in Hadoop

Real-Time Stream Processing in Hadoop

Machine Learning in Hadoop

Hadoop in the Cloud

Hadoop Cluster Profiling

Who Can Do What in Hadoop

Network and Data Security

Monitoring Hadoop

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Chapter 10. Machine Learning in Hadoop

This chapter is about how to design and architect machine learning applications in the Hadoop platform. It addresses some of the common machine learning challenges that you can face in Hadoop and how to solve these. In this chapter, we will walk through different machine learning libraries and processing engines. This chapter also covers some of the common steps involved in machine learning and further elaborates on this with a case study.

In this chapter, we will cover the following topics:

Machine learning steps
Common machine learning challenges
Spark machine learning
Hadoop and R
Mahout
Case study in Spark

Technical requirements

You will be required to have basic knowledge of Linux and Apache Hadoop 3.0.

Check out the following video to see the code in action:http://bit.ly/2VpRc7N

Machine learning steps

We will look at the different features of machine learning in the following steps:

Gathering data: Well, this step you have seen and heard of many times. It is about ingesting data from multiple data sources for your machine learning steps to use. For machine learning, quality of data and quantity of data both matter. Therefore, this step is crucial.
Preparing the data: In this step, after performing the previous step of gathering data, we load our data into a suitable place and prepare it for use in our machine learning processes.
Choosing a model: In this step, you get to decide which algorithm to choose and what kind of problem you are trying to solve. So, you decide whether a particular class of problems belongs to classification, regression, or forecasting. The type of algorithm you choose to apply will be based on trail and tuning basis.
Training: In this step, we actually train our models on bulk data. Here, you first perform data sampling (downsample or upsample...

Common machine learning challenges

The following are some of the common challenges that you will face while running your machine learning application:

Data quality: Data from sources is, most of the time, not suitable for machine learning. It has to be cleaned or checked for data quality first. Data has to be in the format that is suitable for the machine learning processes that you want to run. One such example would be removing nulls. The popular machine learning algorithm Random Forest does not support nulls.
Data scaling: Sometimes, your data is comprised of attributes that vary in magnitude or scale. So, to prevent machine learning algorithms from being unbiased to re-scaling, under-scaled or over-scaled, attributes of the same scale is helpful. This helps machine learning optimization algorithms like gradient descent a great deal. Algorithms that iteratively weigh inputs, like regression and neural networks, or algorithms that are based on distance measures, like k-nearest neighbors...

Spark machine learning

Spark is the distributed in-memory processing engine that runs machine learning algorithms in distributed mode by using abstract APIs. Using a Spark machine learning framework, machine learning algorithms can be applied on large volumes of data, represented as resilient distributed datasets. Spark machine learning libraries come with a rich set of utilities, components, and tools that let you write in-memory, processed, distributed code in an efficient and fault-tolerant manner. The following diagram represents the Spark architecture at a high level:

There are three Java virtual machine (JVM) based components in Spark: they are Driver, Spark executor, and Cluster Manager. These explained as follows:

Driver: The Driver Program runs on a logically or physically segregated node as a separate process and is responsible for launching the Spark application, maintaining all relevant information and configurations about launched Spark applications, executing the DAG application...

Hadoop and R

R is a data science programming tool for analyzing statistical data on models and translating analytical results into colorful graphics. R without the doubt is the most preferred programming tool for statisticians, data scientists, data analysts, and data architects, but when working with large datasets, it is short. One major disadvantage of the R programming language is that all objects are loaded into a single machine's main memory. Large petabyte size datasets cannot be loaded into the RAM. Hadoop is an ideal solution when it is integrated with R language. Data scientists must limit their data analysis to a sample of data from the large dataset to adapt to the single machine limitation of the R programming language in memory. When dealing with big data, this limitation of the R programming language is a major obstacle. Since R is not very scalable, only limited data can be processed by the core R engine. Its data processing capacity is limited to one node memory. This limits...

Mahout

Mahout is the Apache library for open source learning. Mahout mainly uses, but is not limited to, the classification and dimensional algorithms of clustering recommend engines (collaborative filtering and classification). Mahout's objective is to provide the usual machine learning algorithms with a highly scalable implementation. If the historical data to be used is large, then Mahout is the machinery of choice. We generally find that it is not possible to process the data on a single device. With large data becoming an important area of focus, Mahout meets the need for a machine learning tool that can extend beyond a single computer. Mahout is different from other tools such as R, Weka, and so on, as its emphasis on scalability. The Mahout learning implementations are written in Java, and most but not all of them are compiled using the MapReduce paradigm on Apache's distributed Hadoop calculation project. Mahout will be built using Scala DSL on Apache Spark, and programs written...

Machine learning case study in Spark

In this section, we will look into how to implement text classification using the Spark ML and Naive Bayes algorithms. The classification of text is one of NLP's most common cases of use. Text classification can be used to detect email spam, identify retail product hierarchy, and analyze feelings. This process is typically a problem of classification in which we try to identify a specific subject from a natural language source with a large volume of data. We can discuss several topics within each of the data groups and it is therefore important to classify the article or textual information in logical groups. The techniques of text classification help us to do this. These techniques require a lot of computing power if the data volume is large and a distributed computing framework for text classification is recommended. For example, if we want to classify legal documents in a knowledge repository on the internet, text classification techniques can be used...

Summary

In this chapter, we studied the different machine learning steps and the common ML challenges that we face. We also covered the Spark ML algorithms and how they can be applied on large volumes of data, represented as resilient distributed datasets. This chapter also covered R as the most preferred programming tool for statisticians, data scientists, data analysts, and data architects. We learned about Mahout, which provides the usual machine learning algorithms with a highly scalable implementation, along with a case study on Spark.

In the next chapter, we will get an overview of Hadoop on the cloud.