Packt+ | Advance your knowledge in tech

You're reading from Practical Big Data Analytics

Product typeBook

Published inJan 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781783554393

Edition1st Edition

Languages

Java

Tools

Hadoop Apache Spark

Concepts

Big Data

Author (1)

Nataraj Dasgupta

Chapter 7. An Introduction to Machine Learning Concepts

Machinelearning has become a commonplace topic in our day-to-day lives. The advancement in the field has been so dramatic that today, even cell phones incorporate advanced machine learning and artificial intelligence-related facilities, capable of responding and taking actions based on human instructions.

A subject that was once limited to university classrooms has transformed into a full-fledged industry, pervading our daily lives in ways we could not have envisioned even just a few years ago.

The aim of this chapter is to introduce the reader to the underpinnings of machine learning and explain the concepts in simple, lucid terms that will help readers become familiar with the core ideas in the subject. We'll start off with a high-level overview of machine learning, and explain the different categories and how to distinguish them. We'll explain some of the salient concepts in machine learning, such as data pre-processing, feature engineering...

What is machine learning?

Machine learning is not a new subject; it has existed in academia for well over 70 years as a formal discipline, but known by different names: statistics, and more generally mathematics, then artificial intelligence (AI), and today as machine learning. While the other related subject areas of statistics and AI are just as prevalent, machine learning has carved out a separate niche and become an independent discipline in and of itself.

In simple terms, machine learning involves predicting future events based on historical data. We see it manifested in our day-to-day lives and indeed we employ, knowingly or otherwise, principles of machine learning on a daily basis.

When we casually comment on whether a movie will succeed at the box office using our understanding of the popularity of the individuals in the lead roles, we are applying machine learning, albeit subconsciously. Our understanding of the characters in the lead roles has been shaped over years of watching...

Factors that led to the success of machine learning

Given machine learning, as a subject, has existed for many decades, it begs the question: why hadn't it become as popular as it is today much sooner? Indeed, the theories of complex machine learning algorithms such as neural networks were well known by the late 1990s, and the foundation had been established well before that in the theoretical realm.

There are a few factors that can be attributed to the success of machine learning:

The Internet: The web played a critical role in democratizing information and connecting people in an unprecedented way. It made the exchange of information simple in a way that could not have been achieved through the pre-existing methods of print media communication. Not only did the web transform and revolutionize the dissemination of information, it also opened up new opportunities. Google's PageRank, as mentioned earlier, was one of the first large-scale and highly visible successes in the application of statistical...

Machine learning, statistics, and AI

Machine learning is a term that has various synonyms - names that are the result of either marketing activities by corporates or just terms that have been used interchangeably. Although some may argue that they have different implications, they all ultimately refer to machine learning as a subject that facilitates the prediction of future events using historical information.

The commonly heard terms for machine learning include predictive analysis, predictive analytics, predictive modeling, and many others. As such, unless the entity that publishes material explaining their interpretation of the term and more specifically, how it is different, it is generally safe to assume that they are referring to machine learning. This is often a source of confusion among those new to the subject, largely due to the misuse and overuse of technical verbiage.

Statistics, on the other hand, is a distinct subject area that has been well known for over 200 years. The word...

Categories of machine learning

Arthur Samuel coined the term machine learning in 1959 while at IBM. A popular definition of machine learning is due to Arthur, who, it is believed, called machine learning a field of computer science that gives computers the ability to learn without being explicitly programmed.

Tom Mitchell, in 1998, added a more specific definition to machine learning and called it a, study of algorithms that improve their performance P at some task T with experience E.

A simple explanation would help to illustrate this concept. By now, most of us are familiar with the concept of spam in emails. Most email accounts also contain a separate folder known as Junk, Spam, or a related term. A cursory check of the folders will usually indicate the presence of several emails, many of which were presumably unsolicited and contain meaningless information.

The mere task of categorizing emails as spam and moving them to a folder involves the application of machine learning. Andrew Ng highlighted...

Subdividing supervised machine learning

Supervised machine learning can be further subdivided into exercises that involve either of the following:

Classification
Regression

The concepts are quite straightforward.

Classification involves a machine learning task that has a discrete outcome - a categorical outcome. All nouns are categorical variables, such as fruits, trees, color, and true/false.

The outcome variables in classification exercises are also known as discrete or categorical variables.

Some examples include:

Identifying the fruit given size, weight, and shape
Identifying numbers given a set of images of numbers (as shown in the earlier chapter)
Identifying objects on the streets
Identifying playing cards as diamonds, spades, hearts and clubs
Identifying the class rank of a student based on the student's grade
The last one might not seem obvious, but a rank, that is, 1^st, 2^nd, 3^rd denotes a fixed category. A student could rank, say, 1^st or 2^nd, but not have a rank of 1.5!

Images of some atypical...

Common terminologies in machine learning

In machine learning, you'll often hear the terms features, predictors, and dependent variables. They are all one and the same. They all refer to the variables that are used to predict an outcome. In our previous example of cars, the variables cyl (Cylinder), hp (Horsepower), wt (Weight), and gear (Gear) are the predictors and mpg (Miles Per Gallon) is the outcome.

In simpler terms, taking the example of a spreadsheet, the names of the columns are, in essence, known as features, predictors, and dependent variables. As an example, if we were given a dataset of toll booth charges and were tasked with predicting the amount charged based on the time of day and other factors, a hypothetical example could be as follows:

In this spreadsheet, the columns date, time, agency, type, prepaid, and rate are the features or predictors, whereas, the column amount is our outcome or dependent variable (what we are predicting).

The value of amount depends on the value of...

The core concepts in machine learning

There are many important concepts in machine learning; we'll go over some of the more common topics. Machine learning involves a multi-step process that starts with data acquisition, data mining, and eventually leads to building the predictive models.

The key aspects of the model-building process involve:

Data pre-processing: Pre-processing and feature selection (for example, centering and scaling, class imbalances, and variable importance)
Train, test splits and cross-validation:
- Creating the training set (say, 80 percent of the data)
- Creating the test set (~ 20 percent of the data)
- Performing cross-validation
Create model, get predictions:
- Which algorithms should you try?
- What accuracy measures are you trying to optimize?
- What tuning parameters should you use?

Data management steps in machine learning

Pre-processing, or more generally processing the data, is an integral part of most machine learning exercises. A dataset that you start out with is seldom going...

Leveraging multicore processing in the model

The exercise in the previous section is repeated here using the PimaIndianDiabetes2 dataset instead. This dataset contains several missing values. As a result, we will first impute the missing values and then run the machine learning example.

The exercise has been repeated with some additional nuances, such as using multicore/parallel processing in order to make the cross-validations run faster.

To leverage multicore processing, install the package doMC using the following code:

Install.packages("doMC")  # Install package for multicore processing 
Install.packages("nnet") # Install package for neural networks in R

Now we will run the program as shown in the code here:

# Load the library doMC 
library(doMC) 
 
# Register all cores 
registerDoMC(cores = 8) 
 
# Set seed to create a reproducible example 
set.seed(100) 
 
# Load the PimaIndiansDiabetes2 dataset 
data("PimaIndiansDiabetes2",package = 'mlbench') 
diab<- PimaIndiansDiabetes2 
 
# This...

Summary

In this chapter, we learnt about the basic fundamentals of Machine Learning, the different types such as Supervised and Unsupervised and major concepts such as data pre-processing, data imputation, managing imbalanced classes and other topics.

We also learnt about the key distinctions between terms that are being used interchangeably today, in particular the terms AI and Machine Learning. We learned that artificial intelligence deals with a vast array of topics, such as game theory, sociology, constrained optimizations, and machine learning; AI is much broader in scope relative to machine learning.

Machine learning facilitates AI; namely, machine learning algorithms are used to create systems that are artificially intelligent, but they differ in scope. A regression problem (finding the line of best fit given a set of points) can be considered a machine learning algorithm, but it is much less likely to be seen as an AI algorithm (conceptually, although it technically could be).

In the...

The rest of the chapter is locked

You have been reading a chapter from

Practical Big Data Analytics

Published in: Jan 2018Publisher: PacktISBN-13: 9781783554393

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta

Other recommended products

Related to this chapter

Web Application Development with R Using Shiny

Shiny is an open source R package that provides an elegant and powerful web framework for building web applications using R. This guide takes a fresh approach to developing scalable web applications. It will enable you to create responsive, interactive web applications using the complete R Shiny suite.

BookSep 2018238 pages

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

Hands-On Big Data Modeling

Big data modeling is very challenging to handle using traditional database modeling and management systems. This book will teach you how to model big data using the latest and more efficient tools such as ERWIN, ANACONDA (Python), and WEKA to model data.

BookNov 2018306 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Hands-on DevOps

VideoDec 20170

Data Lake for Enterprises

The term 'Data Lake' has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights which can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it helps to derive useful information from not only the historical data but also correlates real-time data to enable business for taking critical decisions. This book tries to bring these two important aspects into one, namely data lake and lambda architecture.

BookMay 2017596 pages

Hands-On Data Science with R

Hands-On Data Science with R explore various popular R packages to perform various data science tasks, including core statistical concepts and a wide array of use cases. This practical book covers the entire data science ecosystem for aspiring data scientists, including machine learning, NLP, and neural networks

BookNov 2018420 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Artificial Intelligence for Big Data

Create smart systems to extract intelligent insights for decision making. You will learn about widely used Artificial Intelligence techniques for carrying out solutions in a production-ready environment. You'll explore advanced topics such as clustering, symbolic and sub-symbolic information representation, and many more.

BookMay 2018384 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages