Reader small image

You're reading from  The Python Workshop Second Edition - Second Edition

Product typeBook
Published inNov 2022
Reading LevelN/a
PublisherPackt
ISBN-139781804610619
Edition2nd Edition
Languages
Right arrow
Authors (5):
Corey Wade
Corey Wade
author image
Corey Wade

Corey Wade, M.S. Mathematics, M.F.A. Writing & Consciousness, is the founder and director of Berkeley Coding Academy where he teaches Machine Learning and AI to teens from all over the world. Additionally, Corey chairs the Math Department at Berkeley Independent Study where he has received multiple grants to run after-school coding programs to help bridge the tech skills gap. Additional experiences include teaching Natural Language Processing with Hello World, developing Data Science curricula with Pathstream, and publishing statistics and machine learning models with Towards Data Science, Springboard, and Medium.
Read more about Corey Wade

Mario Corchero Jiménez
Mario Corchero Jiménez
author image
Mario Corchero Jiménez

Mario Corchero Jiménez is a senior software developer at Bloomberg. He leads the Python infrastructure team in London, enabling the company to work effectively in Python and building company-wide libraries and tools. His professional experience is mainly in C++ and Python, and he has contributed some patches to multiple Python open source projects. He is a PSF fellow, having received the Q3 2018 PSF Community Award, is vice president of Python Espaa (the Python Spain association), and has served as Chair of PyLondinium, PyConES17, and PyCon Charlas at PyCon 2018. Mario is passionate about the Python community, open source, and inner source.
Read more about Mario Corchero Jiménez

Andrew Bird
Andrew Bird
author image
Andrew Bird

Andrew Bird is the data and analytics manager of Vesparum Capital. He leads the software and data science teams at Vesparum, overseeing full-stack web development in Django/React. He is an Australian actuary (FIAA, CERA) who has previously worked with Deloitte Consulting in financial services. Andrew also currently works as a full-stack developer for Draftable Pvt. Ltd. He manages the ongoing development of the donation portal for the Effective Altruism Australia website on a voluntary basis. Andrew has also co-written one of our bestselling titles, "The Python Workshop".
Read more about Andrew Bird

Dr. Lau Cher Han
Dr. Lau Cher Han
author image
Dr. Lau Cher Han

Dr Lau Cher Han is a Chief data scientist, and currently the CEO of LEAD, an institution that provides programs on data science, full stack web development, and digital marketing. Well-versed in programming languages: JavaScript, Python, C# and so on he is experienced in web frameworks: MEAN Stack, ASP.NET, Python Django and is multilingual, speaking English, Chinese, Bahasa fluently. His knowledge of Chinese spreads even into its dialects: Hokkien, Teochew, and Cantonese.
Read more about Dr. Lau Cher Han

Graham Lee
Graham Lee
author image
Graham Lee

Graham Lee is an experienced programmer and writer. He has written books including Professional Cocoa Application Security, Test-Driven iOS Development, APPropriate Behaviour and APPosite Concerns. He is a developer who's been programming for long enough to want to start telling other people about the mistakes he's made, in the hope that they'll avoid repeating them. In his case, this means having worked for about 12 years as a professional. His first programming experience can hardly be called professional at all: as it was in BASIC, on a Dragon 32 microcomputer.
Read more about Graham Lee

View More author details
Right arrow

Overview

By the end of this chapter, you will be able to apply machine learning (ML) algorithms to solve different problems; compare, contrast, and apply different types of ML algorithms, including linear regression, logistic regression, decision trees, random forests, Naive Bayes, Adaptive Boosting (AdaBoost), and Extreme Gradient Boosting (XGBoost); analyze overfitting and implement regularization; work with GridSearchCV and RandomizedSearchCV to adjust hyperparameters; evaluate algorithms using a confusion matrix and cross-validation, and solve real-world problems using the ML algorithms outlined here.

Introduction

Computer algorithms enable machines to learn from data. The more data an algorithm receives, the more capable the algorithm is of detecting underlying patterns within the data. In Chapter 10, Data Analytics with pandas and NumPy, you learned how to view and analyze big data with pandas and NumPy. In this chapter, we will now extend these concepts to algorithms that learn from data.

Consider how a child learns to identify a cat. Generally speaking, a child learns by having someone point out “That’s a cat”, “No, that’s a dog”, and so on. After enough cats and non-cats have been pointed out, the child knows how to identify a cat.

ML implements the same general approach. A convolutional neural network (CNN) is an ML algorithm that distinguishes between images. Upon receiving images labeled cats and non-cats, the algorithm looks for underlying patterns within the pixels by adjusting the parameters of an equation until it finds...

Technical requirements

You can find the code files for this chapter on GitHub at https://github.com/PacktPublishing/The-Python-Workshop-Second-Edition/tree/main/Chapter11.

Introduction to linear regression

ML is the ability of computers to learn from data. The power of ML comes from making future predictions based on the data received. Today, ML is used all over the world to predict the weather, stock prices, profits, errors, clicks, purchases, words to complete a sentence, recommend movies, recognize faces and many more things.

The unparalleled success of ML has led to a paradigm shift in the way businesses make decisions. In the past, businesses made decisions based on who had the most influence, but now, the new idea is to make decisions based on data. Decisions are constantly being made about the future, and ML is the best tool at our disposal to convert raw data into actionable decisions.

The first step in building an ML algorithm is deciding what you want to predict. When looking at a DataFrame, the idea is to choose one column as the target column. The target column, by definition, is what the algorithm will be trained to predict.

Recall...

Testing data with cross-validation

In cross-validation, also known as CV, the training data is split into five folds (any number will do, but five is standard). The ML algorithm is fit on one fold at a time and tested on the remaining data. The result is five different training and test sets that are all representative of the same data. The mean of the scores is usually taken as the accuracy of the model.

Note

For cross-validation, 5 folds is only one suggestion. Any natural number may be used, with 3 and 10 also being fairly common.

Cross-validation is a core tool for ML. Mean test scores on different folds are more reliable than one mean test score on the entire set, which we performed in the first exercise. When examining one test score, there is no way of knowing whether it is low or high. Five test scores give a better picture of the true accuracy of the model.

Cross-validation can be implemented in a variety of ways. A standard approach is to use cross_val_score,...

Regularization – Ridge and Lasso

Regularization is an important concept in ML; it’s used to counteract overfitting. In the world of big data, it’s easy to overfit data to the training set. When this happens, the model will often perform badly on the test set, as indicated by mean_squared_error or some other error.

You may wonder why a test set is kept aside at all. Wouldn’t the most accurate ML model come from fitting the algorithm on all the data?

The answer, generally accepted by the ML community after research and experimentation, is no.

There are two main problems with fitting an ML model on all the data:

  • There is no way to test the model on unseen data. ML models are powerful when they make good predictions on new data. Models are trained on known results, but they perform in the real world on data that has never been seen before. It’s not vital to see how well a model fits known results (the training set), but it’s absolutely...

K-nearest neighbors, decision trees, and random forests

Are there other ML algorithms, besides LinearRegression(), that are suitable for the Boston Housing dataset? Absolutely. There are many regressors in the scikit-learn library that may be used. Regressors are a class of ML algorithms that are suitable for continuous target values. In addition to linear regression, Ridge, and Lasso, we can try k-nearest neighbors, decision trees, and random forests. These models perform well on a wide range of datasets. Let’s try them out and analyze them individually.

K-nearest neighbors

The idea behind k-nearest neighbors (KNN) is straightforward. When choosing the output of a row with an unknown label, the prediction is the same as the output of its k-nearest neighbors, where k may be any whole number.

For instance, let’s say that k=3. Given an unknown label, we take n columns for this row and place them in n-dimensional space. Then, we look for the three closest points...

Classification models

The Boston Housing dataset was great for regression because the target column took on continuous values without limit. There are many cases when the target column takes on one or two values, such as TRUE or FALSE, or possibly a grouping of three or more values, such as RED, BLUE, or GREEN. When the target column may be split into distinct categories, the group of ML models that you should try is referred to as classification.

To make things interesting, let’s load a new dataset used to detect pulsar stars in outer space. Go to https://packt.live/33SD0IM and click on Data Folder. Then, click on HTRU2.zip, as shown:

Figure 11.8 – Dataset directory on the UCI website

The dataset consists of 17,898 potential pulsar stars in space. But what are these pulsars? Pulsar stars rotate very quickly, so they have periodic light patterns. Radio frequency interference and noise, however, are attributes that make pulsars very hard to...

Boosting algorithms

Random forests are a type of bagging algorithm. Bagging combines bootstrapping, selecting individual samples with replacement and aggregation, and combining all models into one ensemble. In practice, a random forest builds individual trees by randomly selecting rows of data, called samples, before combining (aggregating) all trees into one ensemble. Bagging algorithms are as good as the trees that make them up.

A comparable ML algorithm is boosting. The idea behind boosting is to transform a weak learner into a strong learner by modifying the weights for the rows that the learner got wrong. A weak learner may have an error of 49%, hardly better than a coin flip. A strong learner, by contrast, may have an error rate of 1 or 2%. With enough iterations, weak learners can be transformed into very strong learners.

Unlike bagging algorithms, boosting algorithms can improve over time. After the initial model in a booster, called the base learner, all subsequent models...

Summary

In this chapter, you have learned how to build a variety of ML models to solve regression and classification problems. You have implemented linear regression, Ridge, Lasso, logistic regression, decision trees, random forests, Naive Bayes, AdaBoost, and XGBoost. You have learned about the importance of using cross-validation to split up your training set and test set. You have learned about the dangers of overfitting and how to correct it with regularization. You have learned how to fine-tune hyperparameters using GridSearchCV and RandomizedSearchCV. You have learned how to interpret imbalanced datasets with a confusion matrix and a classification report. You have also learned how to distinguish between bagging and boosting, and precision and recall.

The value of learning these skills is that you can make meaningful and accurate predictions from big data using some of the best ML models in the world today.

In the next chapter, you will improve your ML skills by learning...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Python Workshop Second Edition - Second Edition
Published in: Nov 2022Publisher: PacktISBN-13: 9781804610619
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (5)

author image
Corey Wade

Corey Wade, M.S. Mathematics, M.F.A. Writing & Consciousness, is the founder and director of Berkeley Coding Academy where he teaches Machine Learning and AI to teens from all over the world. Additionally, Corey chairs the Math Department at Berkeley Independent Study where he has received multiple grants to run after-school coding programs to help bridge the tech skills gap. Additional experiences include teaching Natural Language Processing with Hello World, developing Data Science curricula with Pathstream, and publishing statistics and machine learning models with Towards Data Science, Springboard, and Medium.
Read more about Corey Wade

author image
Mario Corchero Jiménez

Mario Corchero Jiménez is a senior software developer at Bloomberg. He leads the Python infrastructure team in London, enabling the company to work effectively in Python and building company-wide libraries and tools. His professional experience is mainly in C++ and Python, and he has contributed some patches to multiple Python open source projects. He is a PSF fellow, having received the Q3 2018 PSF Community Award, is vice president of Python Espaa (the Python Spain association), and has served as Chair of PyLondinium, PyConES17, and PyCon Charlas at PyCon 2018. Mario is passionate about the Python community, open source, and inner source.
Read more about Mario Corchero Jiménez

author image
Andrew Bird

Andrew Bird is the data and analytics manager of Vesparum Capital. He leads the software and data science teams at Vesparum, overseeing full-stack web development in Django/React. He is an Australian actuary (FIAA, CERA) who has previously worked with Deloitte Consulting in financial services. Andrew also currently works as a full-stack developer for Draftable Pvt. Ltd. He manages the ongoing development of the donation portal for the Effective Altruism Australia website on a voluntary basis. Andrew has also co-written one of our bestselling titles, "The Python Workshop".
Read more about Andrew Bird

author image
Dr. Lau Cher Han

Dr Lau Cher Han is a Chief data scientist, and currently the CEO of LEAD, an institution that provides programs on data science, full stack web development, and digital marketing. Well-versed in programming languages: JavaScript, Python, C# and so on he is experienced in web frameworks: MEAN Stack, ASP.NET, Python Django and is multilingual, speaking English, Chinese, Bahasa fluently. His knowledge of Chinese spreads even into its dialects: Hokkien, Teochew, and Cantonese.
Read more about Dr. Lau Cher Han

author image
Graham Lee

Graham Lee is an experienced programmer and writer. He has written books including Professional Cocoa Application Security, Test-Driven iOS Development, APPropriate Behaviour and APPosite Concerns. He is a developer who's been programming for long enough to want to start telling other people about the mistakes he's made, in the hope that they'll avoid repeating them. In his case, this means having worked for about 12 years as a professional. His first programming experience can hardly be called professional at all: as it was in BASIC, on a Dragon 32 microcomputer.
Read more about Graham Lee