Reader small image

You're reading from  The Machine Learning Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839219061
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Hyatt Saleh
Hyatt Saleh
author image
Hyatt Saleh

Hyatt Saleh discovered the importance of data analysis for understanding and solving real-life problems after graduating from college as a business administrator. Since then, as a self-taught person, she not only works as a machine learning freelancer for many companies globally, but has also founded an artificial intelligence company that aims to optimize everyday processes. She has also authored Machine Learning Fundamentals, by Packt Publishing.
Read more about Hyatt Saleh

Right arrow

1. Introduction to Scikit-Learn

Overview

This chapter introduces the two main topics of this book: machine learning and scikit-learn. By reading this book, you will learn about the concept and application of machine learning. You will also learn about the importance of data in machine learning, as well as the key aspects of data preprocessing to solve a variety of data problems. This chapter will also cover the basic syntax of scikit-learn. By the end of this chapter, you will have a firm understanding of scikit-learn's syntax so that you can solve simple data problems, which will be the starting point for developing machine learning solutions.

Introduction

Machine learning (ML), without a doubt, is one of the most relevant technologies nowadays as it aims to convert information (data) into knowledge that can be used to make informed decisions. In this chapter, you will learn about the different applications of ML in today's world, as well as the role that data plays. This will be the starting point for introducing different data problems throughout this book that you will be able to solve using scikit-learn.

Scikit-learn is a well-documented and easy-to-use library that facilitates the application of ML algorithms by using simple methods, which ultimately enables beginners to model data without the need for deep knowledge of the math behind the algorithms. Additionally, thanks to the ease of use of this library, it allows the user to implement different approximations (that is, create different models) for a data problem. Moreover, by removing the task of coding the algorithm, scikit-learn allows teams to focus their...

Introduction to Machine Learning

Machine learning (ML) is a subset of Artificial Intelligence (AI) that consists of a wide variety of algorithms capable of learning from the data that is being fed to them, without being specifically programmed for a task. This ability to learn from data allows the algorithms to create models that are capable of solving complex data problems by finding patterns in historical data and improving them as new data is fed to the models.

These different ML algorithms use different approximations to solve a task (such as probability functions), but the key element is that they are able to consider a countless number of variables for a particular data problem, making the final model better at solving the task than humans are. The models that are created using ML algorithms are created to find patterns in the input data so that those patterns can be used to make informed predictions in the future.

Applications of ML

Some of the popular tasks that can...

Scikit-Learn

Created in 2007 by David Cournapeau as part of a Google Summer of Code project, scikit-learn is an open source Python library made to facilitate the process of building models based on built-in ML and statistical algorithms, without the need for hardcoding. The main reasons for its popular use are its complete documentation, its easy-to-use API, and the many collaborators who work every day to improve the library.

Note

You can find the documentation for scikit-learn at http://scikit-learn.org.

Scikit-learn is mainly used to model data, and not as much to manipulate or summarize data. It offers its users an easy-to-use, uniform API to apply different models with little learning effort, and no real knowledge of the math behind it is required.

Note

Some of the math topics that you need to know about to understand the models are linear algebra, probability theory, and multivariate calculus. For more information on these models, visit https://towardsdatascience...

Data Representation

The main objective of ML is to build models by interpreting data. To do so, it is highly important to feed the data in a way that is readable by the computer. To feed data into a scikit-learn model, it must be represented as a table or matrix of the required dimensions, which we will discuss in the following section.

Tables of Data

Most tables that are fed into ML problems are two-dimensional, meaning that they contain rows and columns. Conventionally, each row represents an observation (an instance), whereas each column represents a characteristic (feature) of each observation.

The following table is a fragment of a sample dataset of scikit-learn. The purpose of the dataset is to differentiate from among three types of iris plants based on their characteristics. Hence, in the following table, each row embodies a plant and each column denotes the value of that feature for every plant:

Figure 1.2: A table showing the first 10 instances...

Data Preprocessing

Data preprocessing is a very critical step for developing ML solutions as it helps make sure that the model is not trained on biased data. It has the capability to improve a model's performance, and it is often the reason why the same algorithm for the same data problem works better for a programmer that has done an outstanding job preprocessing the dataset.

For the computer to be able to understand the data proficiently, it is necessary to not only feed the data in a standardized way but also make sure that the data does not contain outliers or noisy data, or even missing entries. This is important because failing to do so might result in the algorithm making assumptions that are not true to the data. This will cause the model to train at a slower pace and to be less accurate due to misleading interpretations of data.

Moreover, data preprocessing does not end there. Models do not work the same way, and each one makes different assumptions. This means...

Scikit-Learn API

The objective of the scikit-learn API is to provide an efficient and unified syntax to make ML accessible to non-ML experts, as well as to facilitate and popularize its use among several industries.

How Does It Work?

Although it has many collaborators, the scikit-learn API was built and has been updated by considering a set of principles that prevent framework code proliferation, where different code performs similar functionalities. On the contrary, it promotes simple conventions and consistency. Due to this, the scikit-learn API is consistent among all models, and once the main functionalities have been learned, it can be used widely.

The scikit-learn API is divided into three complementary interfaces that share a common syntax and logic: the estimator, the predictor, and the transformer. The estimator interface is used for creating models and fitting the data into them; the predictor, as its name suggests, is used to make predictions based on the models...

Supervised and Unsupervised Learning

ML is divided into two main categories: supervised and unsupervised learning.

Supervised Learning

Supervised learning consists of understanding the relationship between a given set of features and a target value, also known as a label or class. For instance, it can be used for modeling the relationship between a person's demographic information and their ability to pay loans, as shown in the following table:

Figure 1.18: The relationship between a person's demographic information and the ability to pay loans

Models trained to foresee these relationships can then be applied to predict labels for new data. As we can see from the preceding example, a bank that builds such a model can then input data from loan applicants to determine if they are likely to pay back the loan.

These models can be further divided into classification and regression tasks, which are explained as follows.

Classification tasks...

Summary

ML consists of constructing models that are able to convert data into knowledge that can be used to make decisions, some of which are based on complicated mathematical concepts to understand data. Scikit-learn is an open source Python library that is meant to facilitate the process of applying these models to data problems, without much complex math knowledge required.

This chapter explained the key steps of preprocessing your input data, from separating the features from the target, to dealing with messy data and rescaling the values of the data. All these steps should be performed before diving into training a model as they help to improve the training times, as well as the performance of the models.

Next, the different components of the scikit-learn API were explained: the estimator, the predictor, and the transformer. Finally, this chapter covered the difference between supervised and unsupervised learning, and the most popular algorithms of each type of learning...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Machine Learning Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781839219061
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Hyatt Saleh

Hyatt Saleh discovered the importance of data analysis for understanding and solving real-life problems after graduating from college as a business administrator. Since then, as a self-taught person, she not only works as a machine learning freelancer for many companies globally, but has also founded an artificial intelligence company that aims to optimize everyday processes. She has also authored Machine Learning Fundamentals, by Packt Publishing.
Read more about Hyatt Saleh