Reader small image

You're reading from  Hands-On Web Scraping with Python - Second Edition

Product typeBook
Published inOct 2023
PublisherPackt
ISBN-139781837636211
Edition2nd Edition
Right arrow
Author (1)
Anish Chapagain
Anish Chapagain
author image
Anish Chapagain

Anish Chapagain is a software engineer with a passion for data science, its processes, and Python programming, which began around 2007. He has been working with web scraping and analysis-related tasks for more than 5 years, and is currently pursuing freelance projects in the web scraping domain. Anish previously worked as a trainer, web/software developer, and as a banker, where he was exposed to data and gained further insights into topics including data analysis, visualization, data mining, information processing, and knowledge discovery. He has an MSc in computer systems from Bangor University (University of Wales), United Kingdom, and an Executive MBA from Himalayan Whitehouse International College, Kathmandu, Nepal.
Read more about Anish Chapagain

Right arrow

Machine Learning and Web Scraping

So far, we have learned about data extraction, data storage, and acquiring and analyzing information from data by using a number of Python libraries. This chapter will provide you with introductory information on Machine Learning (ML) with a few examples.

Web scraping involves studying a website, identifying collectible data elements, and planning and processing a script to extract and collect data in datasets or files. This collected data will be cleaned and processed further to generate information or valuable insights. ML is a branch of Artificial Intelligence (AI) and generally deals with statistical and mathematical processes. ML is used to develop, train, and evaluate algorithms that can be automated, keep learning from the outputs, and minimize human intervention.

ML uses data to learn, predict, classify, and test situations, and for many other functions. Data is collected using web scraping techniques, so there is a correlation between...

Technical requirements

A web browser (Google Chrome or Mozilla Firefox) will be required and we will be using JupyterLab for Python code.

Please refer to the Setting things up and Creating a virtual environment sections of Chapter 2 to continue with setting up and using the environment created.

The Python libraries that are required for this chapter are as follows:

The code files for this chapter are available online in this book’s GitHub repository: https://github.com/PacktPublishing/Hands-On-Web-Scraping-with-Python-Second-Edition/tree/main/Chapter11.

Introduction to ML

Data collection, analysis, and the mining of data to extract information are major agendas of many data-related systems. Processing, analyzing, and executing mining-related functions requires processing time, evaluation, and interpretation to reach the desired state. Using ML, systems can be trained on relevant or sample data and ML can be further used to evaluate and interpret other data or datasets for the final output.

ML-based processing is implemented similarly to and can be compared to data mining and predictive modeling, for example, classifying emails in an inbox as spam and not spam. Spam detection is a kind of decision-making to classify emails according to their content. A system or spam-detecting algorithm is trained on inputs or datasets and can distinguish emails as spam or not.

ML predictions and decision-making models are dependent on data. ML models can be built on top of, and also use, several algorithms, which allows the system to provide...

ML using scikit-learn

To develop a model, we need datasets. Web scraping is again the perfect technique to collect the desired data and store it in the relevant format. There are plenty of ML-related libraries and frameworks available in Python, and they are growing in number. scikit-learn is a Python library that addresses and helps to deal with the majority of supervised ML features.

scikit-learn is also known and used as sklearn. It is built upon numpy, scipy, and matplotlib. The library provides a large number of features related to ML aspects such as classification, clustering, regression, and preprocessing. We will explore beginner and intermediate concepts of the supervised learning type with regression using scikit-learn. You can also explore the sklearn user guide available at https://scikit-learn.org/stable/user_guide.html.

We have covered a lot of information about regression in previous sections of this chapter. Regression is a supervised learning technique that is...

Summary

Python programming makes a huge contribution in AI- and ML-related domains. In this chapter, we have had only a glimpse of that. Quality data plays a very important role in ML. Whether collecting data via web scraping and storing it or providing scraped data on the fly to an ML model, prepared data is in demand. The better the quality of the data – and the more precise the data is – that we provide to ML algorithms, and for plotting charts, the more accurate results, visualizations, and descriptive plots we can expect.

We have now learned about ML concepts and various aspects of ML by exploring them. We have also learned how to implement ML models and collect the results, if required, from various processes. To summarize, we now have an overview of how to use scikit-learn and conduct sentiment analysis. ML is data-driven and quality data is a basic requirement for ML models to provide accuracy.

In the next chapter, we will learn about a few further steps...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Web Scraping with Python - Second Edition
Published in: Oct 2023Publisher: PacktISBN-13: 9781837636211
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Anish Chapagain

Anish Chapagain is a software engineer with a passion for data science, its processes, and Python programming, which began around 2007. He has been working with web scraping and analysis-related tasks for more than 5 years, and is currently pursuing freelance projects in the web scraping domain. Anish previously worked as a trainer, web/software developer, and as a banker, where he was exposed to data and gained further insights into topics including data analysis, visualization, data mining, information processing, and knowledge discovery. He has an MSc in computer systems from Bangor University (University of Wales), United Kingdom, and an Executive MBA from Himalayan Whitehouse International College, Kathmandu, Nepal.
Read more about Anish Chapagain