You're reading from The Python Workshop Second Edition - Second Edition

Product typeBook

Published inNov 2022

Reading LevelN/a

PublisherPackt

ISBN-139781804610619

Edition2nd Edition

Languages

Python

Concepts

Programming Language

Authors (5):

Corey Wade

Mario Corchero Jiménez

Andrew Bird

Dr. Lau Cher Han

Graham Lee

View More author details

Overview

By the end of this chapter, you will be able to apply machine learning (ML) algorithms to solve different problems; compare, contrast, and apply different types of ML algorithms, including linear regression, logistic regression, decision trees, random forests, Naive Bayes, Adaptive Boosting (AdaBoost), and Extreme Gradient Boosting (XGBoost); analyze overfitting and implement regularization; work with GridSearchCV and RandomizedSearchCV to adjust hyperparameters; evaluate algorithms using a confusion matrix and cross-validation, and solve real-world problems using the ML algorithms outlined here.

Introduction

Computer algorithms enable machines to learn from data. The more data an algorithm receives, the more capable the algorithm is of detecting underlying patterns within the data. In Chapter 10, Data Analytics with pandas and NumPy, you learned how to view and analyze big data with pandas and NumPy. In this chapter, we will now extend these concepts to algorithms that learn from data.

Consider how a child learns to identify a cat. Generally speaking, a child learns by having someone point out “That’s a cat”, “No, that’s a dog”, and so on. After enough cats and non-cats have been pointed out, the child knows how to identify a cat.

ML implements the same general approach. A convolutional neural network (CNN) is an ML algorithm that distinguishes between images. Upon receiving images labeled cats and non-cats, the algorithm looks for underlying patterns within the pixels by adjusting the parameters of an equation until it finds...

Technical requirements

You can find the code files for this chapter on GitHub at https://github.com/PacktPublishing/The-Python-Workshop-Second-Edition/tree/main/Chapter11.

Introduction to linear regression

ML is the ability of computers to learn from data. The power of ML comes from making future predictions based on the data received. Today, ML is used all over the world to predict the weather, stock prices, profits, errors, clicks, purchases, words to complete a sentence, recommend movies, recognize faces and many more things.

The unparalleled success of ML has led to a paradigm shift in the way businesses make decisions. In the past, businesses made decisions based on who had the most influence, but now, the new idea is to make decisions based on data. Decisions are constantly being made about the future, and ML is the best tool at our disposal to convert raw data into actionable decisions.

The first step in building an ML algorithm is deciding what you want to predict. When looking at a DataFrame, the idea is to choose one column as the target column. The target column, by definition, is what the algorithm will be trained to predict.

Recall...

Testing data with cross-validation

In cross-validation, also known as CV, the training data is split into five folds (any number will do, but five is standard). The ML algorithm is fit on one fold at a time and tested on the remaining data. The result is five different training and test sets that are all representative of the same data. The mean of the scores is usually taken as the accuracy of the model.

Note

For cross-validation, 5 folds is only one suggestion. Any natural number may be used, with 3 and 10 also being fairly common.

Cross-validation is a core tool for ML. Mean test scores on different folds are more reliable than one mean test score on the entire set, which we performed in the first exercise. When examining one test score, there is no way of knowing whether it is low or high. Five test scores give a better picture of the true accuracy of the model.

Cross-validation can be implemented in a variety of ways. A standard approach is to use cross_val_score,...

Regularization – Ridge and Lasso

Regularization is an important concept in ML; it’s used to counteract overfitting. In the world of big data, it’s easy to overfit data to the training set. When this happens, the model will often perform badly on the test set, as indicated by mean_squared_error or some other error.

You may wonder why a test set is kept aside at all. Wouldn’t the most accurate ML model come from fitting the algorithm on all the data?

The answer, generally accepted by the ML community after research and experimentation, is no.

There are two main problems with fitting an ML model on all the data:

There is no way to test the model on unseen data. ML models are powerful when they make good predictions on new data. Models are trained on known results, but they perform in the real world on data that has never been seen before. It’s not vital to see how well a model fits known results (the training set), but it’s absolutely...

K-nearest neighbors, decision trees, and random forests

Are there other ML algorithms, besides LinearRegression(), that are suitable for the Boston Housing dataset? Absolutely. There are many regressors in the scikit-learn library that may be used. Regressors are a class of ML algorithms that are suitable for continuous target values. In addition to linear regression, Ridge, and Lasso, we can try k-nearest neighbors, decision trees, and random forests. These models perform well on a wide range of datasets. Let’s try them out and analyze them individually.

K-nearest neighbors

The idea behind k-nearest neighbors (KNN) is straightforward. When choosing the output of a row with an unknown label, the prediction is the same as the output of its k-nearest neighbors, where k may be any whole number.

For instance, let’s say that k=3. Given an unknown label, we take n columns for this row and place them in n-dimensional space. Then, we look for the three closest points...

Classification models

The Boston Housing dataset was great for regression because the target column took on continuous values without limit. There are many cases when the target column takes on one or two values, such as TRUE or FALSE, or possibly a grouping of three or more values, such as RED, BLUE, or GREEN. When the target column may be split into distinct categories, the group of ML models that you should try is referred to as classification.

To make things interesting, let’s load a new dataset used to detect pulsar stars in outer space. Go to https://packt.live/33SD0IM and click on Data Folder. Then, click on HTRU2.zip, as shown:

Figure 11.8 – Dataset directory on the UCI website

The dataset consists of 17,898 potential pulsar stars in space. But what are these pulsars? Pulsar stars rotate very quickly, so they have periodic light patterns. Radio frequency interference and noise, however, are attributes that make pulsars very hard to...

Boosting algorithms

Random forests are a type of bagging algorithm. Bagging combines bootstrapping, selecting individual samples with replacement and aggregation, and combining all models into one ensemble. In practice, a random forest builds individual trees by randomly selecting rows of data, called samples, before combining (aggregating) all trees into one ensemble. Bagging algorithms are as good as the trees that make them up.

A comparable ML algorithm is boosting. The idea behind boosting is to transform a weak learner into a strong learner by modifying the weights for the rows that the learner got wrong. A weak learner may have an error of 49%, hardly better than a coin flip. A strong learner, by contrast, may have an error rate of 1 or 2%. With enough iterations, weak learners can be transformed into very strong learners.

Unlike bagging algorithms, boosting algorithms can improve over time. After the initial model in a booster, called the base learner, all subsequent models...

Summary

In this chapter, you have learned how to build a variety of ML models to solve regression and classification problems. You have implemented linear regression, Ridge, Lasso, logistic regression, decision trees, random forests, Naive Bayes, AdaBoost, and XGBoost. You have learned about the importance of using cross-validation to split up your training set and test set. You have learned about the dangers of overfitting and how to correct it with regularization. You have learned how to fine-tune hyperparameters using GridSearchCV and RandomizedSearchCV. You have learned how to interpret imbalanced datasets with a confusion matrix and a classification report. You have also learned how to distinguish between bagging and boosting, and precision and recall.

The value of learning these skills is that you can make meaningful and accurate predictions from big data using some of the best ML models in the world today.

In the next chapter, you will improve your ML skills by learning...

The rest of the chapter is locked

You have been reading a chapter from

The Python Workshop Second Edition - Second Edition

Published in: Nov 2022Publisher: PacktISBN-13: 9781804610619

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (5)

Corey Wade

Corey Wade, M.S. Mathematics, M.F.A. Writing & Consciousness, is the founder and director of Berkeley Coding Academy where he teaches Machine Learning and AI to teens from all over the world. Additionally, Corey chairs the Math Department at Berkeley Independent Study where he has received multiple grants to run after-school coding programs to help bridge the tech skills gap. Additional experiences include teaching Natural Language Processing with Hello World, developing Data Science curricula with Pathstream, and publishing statistics and machine learning models with Towards Data Science, Springboard, and Medium.
Read more about Corey Wade

Mario Corchero Jiménez

Mario Corchero Jiménez is a senior software developer at Bloomberg. He leads the Python infrastructure team in London, enabling the company to work effectively in Python and building company-wide libraries and tools. His professional experience is mainly in C++ and Python, and he has contributed some patches to multiple Python open source projects. He is a PSF fellow, having received the Q3 2018 PSF Community Award, is vice president of Python Espaa (the Python Spain association), and has served as Chair of PyLondinium, PyConES17, and PyCon Charlas at PyCon 2018. Mario is passionate about the Python community, open source, and inner source.
Read more about Mario Corchero Jiménez

Andrew Bird

Andrew Bird is the data and analytics manager of Vesparum Capital. He leads the software and data science teams at Vesparum, overseeing full-stack web development in Django/React. He is an Australian actuary (FIAA, CERA) who has previously worked with Deloitte Consulting in financial services. Andrew also currently works as a full-stack developer for Draftable Pvt. Ltd. He manages the ongoing development of the donation portal for the Effective Altruism Australia website on a voluntary basis. Andrew has also co-written one of our bestselling titles, "The Python Workshop".
Read more about Andrew Bird

Dr. Lau Cher Han

Dr Lau Cher Han is a Chief data scientist, and currently the CEO of LEAD, an institution that provides programs on data science, full stack web development, and digital marketing. Well-versed in programming languages: JavaScript, Python, C# and so on he is experienced in web frameworks: MEAN Stack, ASP.NET, Python Django and is multilingual, speaking English, Chinese, Bahasa fluently. His knowledge of Chinese spreads even into its dialects: Hokkien, Teochew, and Cantonese.
Read more about Dr. Lau Cher Han

Graham Lee

Graham Lee is an experienced programmer and writer. He has written books including Professional Cocoa Application Security, Test-Driven iOS Development, APPropriate Behaviour and APPosite Concerns. He is a developer who's been programming for long enough to want to start telling other people about the mistakes he's made, in the hope that they'll avoid repeating them. In his case, this means having worked for about 12 years as a professional. His first programming experience can hardly be called professional at all: as it was in BASIC, on a Dragon 32 microcomputer.
Read more about Graham Lee

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5