Reader small image

You're reading from  Hands-On Data Analysis with Scala

Product typeBook
Published inMay 2019
Reading LevelExpert
PublisherPackt
ISBN-139781789346114
Edition1st Edition
Languages
Right arrow
Author (1)
Rajesh Gupta
Rajesh Gupta
author image
Rajesh Gupta

Rajesh is a Hands-on Big Data Tech Lead and Enterprise Architect with extensive experience in the full life cycle of software development. He has successfully architected, developed and deployed highly scalable data solutions using Spark, Scala and Hadoop technology stack for several enterprises. A passionate, hands-on technologist, Rajesh has masters degrees in Mathematics and Computer Science from BITS, Pilani (India).
Read more about Rajesh Gupta

Right arrow

Traditional Machine Learning for Data Analysis

This chapter provides an overview of machine learning (ML) techniques for doing data analysis. In the previous chapters, we have explored some of the techniques that can be used by human beings to analyze and understand data. In this chapter, we look at how ML techniques could be used for similar purposes.

At the heart of ML is a number of algorithms that have proven to work for solving specific categories of problems with a high degree of effectiveness. This chapter covers the following popular ML methods:

  • Decision trees
  • Random forests
  • Ridge and lasso regression
  • k-means cluster analysis

It also covers the role of natural language processing (NLP) in effectively analyzing certain types of data problems. The discussion in this chapter is limited to traditional machine learning methods. It does not cover newer methods such as deep...

ML overview

Let's first look at what ML is. In a traditional sense, in order to solve a computational problem, we typically write explicit computer instructions that solve the problem based on all of the possible scenarios. The assumption here is that all of the rules associated with the specific problem being solved are known and well-defined in advance and could be codified into computer instructions. This assumption, however, is not always true. There are times when the rules are not known in advance and it is impractical to define deterministic rules that could be applied to solve the problem.

Let's look at this problem using a concrete example of an app stores where a consumer has the option of buying an app from a fairly large catalog of available apps. When the consumer logs into the app store, it displays a set of recommended apps that the consumer is highly...

Decision trees

As the name suggests, decision trees in ML build a tree-like structure with decision conditions on each branch. Conditions define the flow of the decision-making process. We can also think of decision trees as being similar to flow charts.

Decision trees are supervised ML algorithms. This implies that this algorithm learns from labeled data. It can be used for classification as well as regression.

Implementing decision trees

Let's look at a simple example to understand and explore this concept. We have the following observations:

Age in Years

Height in Inches

Weight in Pounds

Gender

Shoe Size

25

180

200

M

12

35

165

190

F

9

20

175

195

M

11

70

170

200...

Random forest

Random forest is an easy-to-use and powerful ML algorithm. It is also a supervised algorithm and requires labeled data to learn from. In fact, the decision tree acts as the building block for the random forest algorithm. Just like the decision tree, the random forest ML algorithm can be used for classification as well as regression.

The fundamental motivation behind the random forest algorithm is to combine results from multiple random decision trees into a single model. One very nice outcome of the random forest algorithm is that it prevents overfitting of the model to the training dataset.

Random forest algorithms

The random forest algorithm can be summarized as follows:

  • Each decision tree in a random forest...

Ridge and lasso regression

Ridge and lasso regression are supervised linear regression ML algorithms. Both of these algorithms aim at reducing model complexity and prevent overfitting. When there is a large number of features or variables in a training dataset, the model built by ML generally tends to be complex.

Characteristics of ridge regression

The key characteristics of ridge regression are as follows:

  • Coefficient shrinkage: This helps in reducing model complexity
  • Regularization: This adds information to prevent overfitting

Characteristics of lasso regression

Lasso...

k-means cluster analysis

k-means is a clustering ML algorithm. This is a nonsupervised ML algorithm. Its primary use is for clustering together closely related data and gaining an understanding of the structural properties of the data.

As the name suggests, this algorithm tries to form a k number of clusters around k-mean values. How many clusters are to be formed, that is, the value of k, is something a human being has to determine at the outset. This algorithm relies on the Euclidean distance to calculate the distance between two points. We can think of each observation as a point in n-dimensional space, where n is the number of features. The distance between two observations is the Euclidean distance between these in n-dimensional space.

To begin with, the algorithm picks up k random records from the dataset. These are the initial k-mean values. In the next step, for each record...

Natural language processing for data analysis

Natural language processing (NLP) is the ability of a machine to analyze and understand human language. Human language has a very high amount of complexity, which makes parsing and understanding it difficult. There is a great deal of context in spoken and written language. Machines work well with precise rules that are within the confines of good context. With that said, it is still possible to gain an insight into text analysis using NLP techniques. An excellent example of this is Twitter sentiment analysis. Based on the contents of tweets, using NLP, it is possible to determine whether the sentiments of the people are generally positive or negative as a group. Another great example is the successful application of NLP techniques in analyzing customer reviews of a product or service.

The ML algorithms explored so far in this chapter...

Algorithm selections

Each ML algorithm has its own strengths and weaknesses. Selecting an appropriate machine algorithm and tuning the model requires a fair amount of experience working with these algorithms, however, the following factors also play a significant role in applying these techniques effectively:

  • Asking the right question: A great deal of effort is generally required in formulating the right question.
  • Understanding the business domain: Having a good understanding of the relevant business domain and context is equally important to build good models.
  • Understanding data: Ultimately, the data is used to train the model. If the data is not understood correctly or the data quality is poor, the built model is unlikely to be effective.

All of the preceding aspects outlined are somewhat interdependent and a mastery of all of these is a prerequisite to selecting the appropriate...

Summary

In this chapter, we learned about ML and some of the most popular ML algorithms. The primary goal of ML is to build an analytical model using historical data without much human intervention. ML algorithms can be divided into two categories, namely, supervised learning and unsupervised learning. The supervised learning algorithm relies on labeled data to build models, whereas unsupervised learning uses data that is not labeled. We looked at the k-means cluster analysis algorithm, which is an unsupervised ML algorithm. Of the supervised ML algorithms, we explored decision trees, random forests, and ridge/lasso regression. We also got an overview of using NLP for performing text data analysis.

In the next chapter, we will examine the processing of data in real time and perform data analysis as the data becomes available.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Analysis with Scala
Published in: May 2019Publisher: PacktISBN-13: 9781789346114
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rajesh Gupta

Rajesh is a Hands-on Big Data Tech Lead and Enterprise Architect with extensive experience in the full life cycle of software development. He has successfully architected, developed and deployed highly scalable data solutions using Spark, Scala and Hadoop technology stack for several enterprises. A passionate, hands-on technologist, Rajesh has masters degrees in Mathematics and Computer Science from BITS, Pilani (India).
Read more about Rajesh Gupta