Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Hadoop 3

You're reading from  Mastering Hadoop 3

Product type Book
Published in Feb 2019
Publisher Packt
ISBN-13 9781788620444
Pages 544 pages
Edition 1st Edition
Languages
Authors (2):
Chanchal Singh Chanchal Singh
Profile icon Chanchal Singh
Manish Kumar Manish Kumar
Profile icon Manish Kumar
View More author details

Table of Contents (23) Chapters

Title Page
Dedication
About Packt
Foreword
Contributors
Preface
Journey to Hadoop 3 Deep Dive into the Hadoop Distributed File System YARN Resource Management in Hadoop Internals of MapReduce SQL on Hadoop Real-Time Processing Engines Widely Used Hadoop Ecosystem Components Designing Applications in Hadoop Real-Time Stream Processing in Hadoop Machine Learning in Hadoop Hadoop in the Cloud Hadoop Cluster Profiling Who Can Do What in Hadoop Network and Data Security Monitoring Hadoop Other Books You May Enjoy Index

Chapter 10. Machine Learning in Hadoop

This chapter is about how to design and architect machine learning applications in the Hadoop platform. It addresses some of the common machine learning challenges that you can face in Hadoop and how to solve these. In this chapter, we will walk through different machine learning libraries and processing engines. This chapter also covers some of the common steps involved in machine learning and further elaborates on this with a case study.

In this chapter, we will cover the following topics:

  • Machine learning steps
  • Common machine learning challenges
  • Spark machine learning
  • Hadoop and R
  • Mahout
  • Case study in Spark

Technical requirements


You will be required to have basic knowledge of Linux and Apache Hadoop 3.0.

Check out the following video to see the code in action:http://bit.ly/2VpRc7N

 

Machine learning steps


We will look at the different features of machine learning in the following steps:

  1. Gathering data: Well, this step you have seen and heard of many times. It is about ingesting data from multiple data sources for your machine learning steps to use. For machine learning, quality of data and quantity of data both matter. Therefore, this step is crucial.
  2. Preparing the data: In this step, after performing the previous step of gathering data, we load our data into a suitable place and prepare it for use in our machine learning processes.
  3. Choosing a model: In this step, you get to decide which algorithm to choose and what kind of problem you are trying to solve. So, you decide whether a particular class of problems belongs to classification, regression, or forecasting. The type of algorithm you choose to apply will be based on trail and tuning basis.
  4. Training: In this step, we actually train our models on bulk data. Here, you first perform data sampling (downsample or upsample...

Common machine learning challenges


The following are some of the common challenges that you will face while running your machine learning application:

  • Data quality: Data from sources is, most of the time, not suitable for machine learning. It has to be cleaned or checked for data quality first. Data has to be in the format that is suitable for the machine learning processes that you want to run. One such example would be removing nulls. The popular machine learning algorithm Random Forest does not support nulls.
  • Data scaling: Sometimes, your data is comprised of attributes that vary in magnitude or scale. So, to prevent machine learning algorithms from being unbiased to re-scaling, under-scaled or over-scaled, attributes of the same scale is helpful. This helps machine learning optimization algorithms like gradient descent a great deal. Algorithms that iteratively weigh inputs, like regression and neural networks, or algorithms that are based on distance measures, like k-nearest neighbors...

Spark machine learning


Spark is the distributed in-memory processing engine that runs machine learning algorithms in distributed mode by using abstract APIs. Using a Spark machine learning framework, machine learning algorithms can be applied on large volumes of data, represented as resilient distributed datasets. Spark machine learning libraries come with a rich set of utilities, components, and tools that let you write in-memory, processed, distributed code in an efficient and fault-tolerant manner. The following diagram represents the Spark architecture at a high level:

There are three Java virtual machine (JVM) based components in Spark: they are Driver, Spark executor, and Cluster Manager. These explained as follows:

  • Driver: The Driver Program runs on a logically or physically segregated node as a separate process and is responsible for launching the Spark application, maintaining all relevant information and configurations about launched Spark applications, executing the DAG application...

Hadoop and R


R is a data science programming tool for analyzing statistical data on models and translating analytical results into colorful graphics. R without the doubt is the most preferred programming tool for statisticians, data scientists, data analysts, and data architects, but when working with large datasets, it is short. One major disadvantage of the R programming language is that all objects are loaded into a single machine's main memory. Large petabyte size datasets cannot be loaded into the RAM. Hadoop is an ideal solution when it is integrated with R language. Data scientists must limit their data analysis to a sample of data from the large dataset to adapt to the single machine limitation of the R programming language in memory. When dealing with big data, this limitation of the R programming language is a major obstacle. Since R is not very scalable, only limited data can be processed by the core R engine. Its data processing capacity is limited to one node memory. This limits...

Mahout


Mahout is the Apache library for open source learning. Mahout mainly uses, but is not limited to, the classification and dimensional algorithms of clustering recommend engines (collaborative filtering and classification). Mahout's objective is to provide the usual machine learning algorithms with a highly scalable implementation. If the historical data to be used is large, then Mahout is the machinery of choice. We generally find that it is not possible to process the data on a single device. With large data becoming an important area of focus, Mahout meets the need for a machine learning tool that can extend beyond a single computer. Mahout is different from other tools such as R, Weka, and so on, as its emphasis on scalability. The Mahout learning implementations are written in Java, and most but not all of them are compiled using the MapReduce paradigm on Apache's distributed Hadoop calculation project. Mahout will be built using Scala DSL on Apache Spark, and programs written...

Machine learning case study in Spark


In this section, we will look into how to implement text classification using the Spark ML and Naive Bayes algorithms. The classification of text is one of NLP's most common cases of use. Text classification can be used to detect email spam, identify retail product hierarchy, and analyze feelings. This process is typically a problem of classification in which we try to identify a specific subject from a natural language source with a large volume of data. We can discuss several topics within each of the data groups and it is therefore important to classify the article or textual information in logical groups. The techniques of text classification help us to do this. These techniques require a lot of computing power if the data volume is large and a distributed computing framework for text classification is recommended. For example, if we want to classify legal documents in a knowledge repository on the internet, text classification techniques can be used...

Summary


In this chapter, we studied the different machine learning steps and the common ML challenges that we face. We also covered the Spark ML algorithms and how they can be applied on large volumes of data, represented as resilient distributed datasets. This chapter also covered R as the most preferred programming tool for statisticians, data scientists, data analysts, and data architects. We learned about Mahout, which provides the usual machine learning algorithms with a highly scalable implementation, along with a case study on Spark.

In the next chapter, we will get an overview of Hadoop on the cloud.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering Hadoop 3
Published in: Feb 2019 Publisher: Packt ISBN-13: 9781788620444
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}