Reader small image

You're reading from  Practical Big Data Analytics

Product typeBook
Published inJan 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781783554393
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Nataraj Dasgupta
Nataraj Dasgupta
author image
Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta

Right arrow

Chapter 7. An Introduction to Machine Learning Concepts

Machinelearning has become a commonplace topic in our day-to-day lives. The advancement in the field has been so dramatic that today, even cell phones incorporate advanced machine learning and artificial intelligence-related facilities, capable of responding and taking actions based on human instructions.

A subject that was once limited to university classrooms has transformed into a full-fledged industry, pervading our daily lives in ways we could not have envisioned even just a few years ago.

The aim of this chapter is to introduce the reader to the underpinnings of machine learning and explain the concepts in simple, lucid terms that will help readers become familiar with the core ideas in the subject. We'll start off with a high-level overview of machine learning, and explain the different categories and how to distinguish them. We'll explain some of the salient concepts in machine learning, such as data pre-processing, feature engineering...

What is machine learning?


Machine learning is not a new subject; it has existed in academia for well over 70 years as a formal discipline, but known by different names: statistics, and more generally mathematics, then artificial intelligence (AI), and today as machine learning. While the other related subject areas of statistics and AI are just as prevalent, machine learning has carved out a separate niche and become an independent discipline in and of itself.

In simple terms, machine learning involves predicting future events based on historical data. We see it manifested in our day-to-day lives and indeed we employ, knowingly or otherwise, principles of machine learning on a daily basis.

When we casually comment on whether a movie will succeed at the box office using our understanding of the popularity of the individuals in the lead roles, we are applying machine learning, albeit subconsciously. Our understanding of the characters in the lead roles has been shaped over years of watching...

Factors that led to the success of machine learning


Given machine learning, as a subject, has existed for many decades, it begs the question: why hadn't it become as popular as it is today much sooner? Indeed, the theories of complex machine learning algorithms such as neural networks were well known by the late 1990s, and the foundation had been established well before that in the theoretical realm.

There are a few factors that can be attributed to the success of machine learning:

  • The Internet: The web played a critical role in democratizing information and connecting people in an unprecedented way. It made the exchange of information simple in a way that could not have been achieved through the pre-existing methods of print media communication. Not only did the web transform and revolutionize the dissemination of information, it also opened up new opportunities. Google's PageRank, as mentioned earlier, was one of the first large-scale and highly visible successes in the application of statistical...

Machine learning, statistics, and AI


Machine learning is a term that has various synonyms - names that are the result of either marketing activities by corporates or just terms that have been used interchangeably. Although some may argue that they have different implications, they all ultimately refer to machine learning as a subject that facilitates the prediction of future events using historical information.

The commonly heard terms for machine learning include predictive analysis, predictive analytics, predictive modeling, and many others. As such, unless the entity that publishes material explaining their interpretation of the term and more specifically, how it is different, it is generally safe to assume that they are referring to machine learning. This is often a source of confusion among those new to the subject, largely due to the misuse and overuse of technical verbiage.

Statistics, on the other hand, is a distinct subject area that has been well known for over 200 years. The word...

Categories of machine learning


Arthur Samuel coined the term machine learning in 1959 while at IBM. A popular definition of machine learning is due to Arthur, who, it is believed, called machine learning a field of computer science that gives computers the ability to learn without being explicitly programmed.

Tom Mitchell, in 1998, added a more specific definition to machine learning and called it a, study of algorithms that improve their performance P at some task T with experience E.

A simple explanation would help to illustrate this concept. By now, most of us are familiar with the concept of spam in emails. Most email accounts also contain a separate folder known as Junk, Spam, or a related term. A cursory check of the folders will usually indicate the presence of several emails, many of which were presumably unsolicited and contain meaningless information.

The mere task of categorizing emails as spam and moving them to a folder involves the application of machine learning. Andrew Ng highlighted...

Subdividing supervised machine learning


Supervised machine learning can be further subdivided into exercises that involve either of the following:

  • Classification
  • Regression

The concepts are quite straightforward.

Classification involves a machine learning task that has a discrete outcome - a categorical outcome. All nouns are categorical variables, such as fruits, trees, color, and true/false.

The outcome variables in classification exercises are also known as discrete or categorical variables.

Some examples include:

  • Identifying the fruit given size, weight, and shape
  • Identifying numbers given a set of images of numbers (as shown in the earlier chapter)
  • Identifying objects on the streets
  • Identifying playing cards as diamonds, spades, hearts and clubs
  • Identifying the class rank of a student based on the student's grade
  • The last one might not seem obvious, but a rank, that is, 1st, 2nd, 3rd denotes a fixed category. A student could rank, say, 1st or 2nd, but not have a rank of 1.5!

Images of some atypical...

Common terminologies in machine learning


In machine learning, you'll often hear the terms features, predictors, and dependent variables. They are all one and the same. They all refer to the variables that are used to predict an outcome. In our previous example of cars, the variables cyl (Cylinder), hp (Horsepower), wt (Weight), and gear (Gear) are the predictors and mpg (Miles Per Gallon) is the outcome.

In simpler terms, taking the example of a spreadsheet, the names of the columns are, in essence, known as features, predictors, and dependent variables. As an example, if we were given a dataset of toll booth charges and were tasked with predicting the amount charged based on the time of day and other factors, a hypothetical example could be as follows:

In this spreadsheet, the columns date, time, agency, type, prepaid, and rate are the features or predictors, whereas, the column amount is our outcome or dependent variable (what we are predicting).

The value of amount depends on the value of...

The core concepts in machine learning


There are many important concepts in machine learning; we'll go over some of the more common topics. Machine learning involves a multi-step process that starts with data acquisition, data mining, and eventually leads to building the predictive models.

The key aspects of the model-building process involve:

  • Data pre-processing: Pre-processing and feature selection (for example, centering and scaling, class imbalances, and variable importance)
  • Train, test splits and cross-validation:
    • Creating the training set (say, 80 percent of the data)
    • Creating the test set (~ 20 percent of the data)
    • Performing cross-validation
  • Create model, get predictions:
    • Which algorithms should you try?
    • What accuracy measures are you trying to optimize?
    • What tuning parameters should you use?

Data management steps in machine learning

Pre-processing, or more generally processing the data, is an integral part of most machine learning exercises. A dataset that you start out with is seldom going...

Leveraging multicore processing in the model


The exercise in the previous section is repeated here using the PimaIndianDiabetes2 dataset instead. This dataset contains several missing values. As a result, we will first impute the missing values and then run the machine learning example.

The exercise has been repeated with some additional nuances, such as using multicore/parallel processing in order to make the cross-validations run faster.

To leverage multicore processing, install the package doMC using the following code:

Install.packages("doMC")  # Install package for multicore processing 
Install.packages("nnet") # Install package for neural networks in R

Now we will run the program as shown in the code here:

# Load the library doMC 
library(doMC) 
 
# Register all cores 
registerDoMC(cores = 8) 
 
# Set seed to create a reproducible example 
set.seed(100) 
 
# Load the PimaIndiansDiabetes2 dataset 
data("PimaIndiansDiabetes2",package = 'mlbench') 
diab<- PimaIndiansDiabetes2 
 
# This...

Summary


In this chapter, we learnt about the basic fundamentals of Machine Learning, the different types such as Supervised and Unsupervised and major concepts such as data pre-processing, data imputation, managing imbalanced classes and other topics.

We also learnt about the key distinctions between terms that are being used interchangeably today, in particular the terms AI and Machine Learning. We learned that artificial intelligence deals with a vast array of topics, such as game theory, sociology, constrained optimizations, and machine learning; AI is much broader in scope relative to machine learning.

Machine learning facilitates AI; namely, machine learning algorithms are used to create systems that are artificially intelligent, but they differ in scope. A regression problem (finding the line of best fit given a set of points) can be considered a machine learning algorithm, but it is much less likely to be seen as an AI algorithm (conceptually, although it technically could be).

In the...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Big Data Analytics
Published in: Jan 2018Publisher: PacktISBN-13: 9781783554393
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta