Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Hands-On Web Scraping with Python - Second Edition

You're reading from  Hands-On Web Scraping with Python - Second Edition

Product type Book
Published in Oct 2023
Publisher Packt
ISBN-13 9781837636211
Pages 324 pages
Edition 2nd Edition
Languages
Author (1):
Anish Chapagain Anish Chapagain
Profile icon Anish Chapagain

Table of Contents (20) Chapters

Preface 1. Part 1:Python and Web Scraping
2. Chapter 1: Web Scraping Fundamentals 3. Chapter 2: Python Programming for Data and Web 4. Part 2:Beginning Web Scraping
5. Chapter 3: Searching and Processing Web Documents 6. Chapter 4: Scraping Using PyQuery, a jQuery-Like Library for Python 7. Chapter 5: Scraping the Web with Scrapy and Beautiful Soup 8. Part 3:Advanced Scraping Concepts
9. Chapter 6: Working with the Secure Web 10. Chapter 7: Data Extraction Using Web APIs 11. Chapter 8: Using Selenium to Scrape the Web 12. Chapter 9: Using Regular Expressions and PDFs 13. Part 4:Advanced Data-Related Concepts
14. Chapter 10: Data Mining, Analysis, and Visualization 15. Chapter 11: Machine Learning and Web Scraping 16. Part 5:Conclusion
17. Chapter 12: After Scraping – Next Steps and Data Analysis 18. Index 19. Other Books You May Enjoy

Machine Learning and Web Scraping

So far, we have learned about data extraction, data storage, and acquiring and analyzing information from data by using a number of Python libraries. This chapter will provide you with introductory information on Machine Learning (ML) with a few examples.

Web scraping involves studying a website, identifying collectible data elements, and planning and processing a script to extract and collect data in datasets or files. This collected data will be cleaned and processed further to generate information or valuable insights. ML is a branch of Artificial Intelligence (AI) and generally deals with statistical and mathematical processes. ML is used to develop, train, and evaluate algorithms that can be automated, keep learning from the outputs, and minimize human intervention.

ML uses data to learn, predict, classify, and test situations, and for many other functions. Data is collected using web scraping techniques, so there is a correlation between...

Technical requirements

A web browser (Google Chrome or Mozilla Firefox) will be required and we will be using JupyterLab for Python code.

Please refer to the Setting things up and Creating a virtual environment sections of Chapter 2 to continue with setting up and using the environment created.

The Python libraries that are required for this chapter are as follows:

The code files for this chapter are available online in this book’s GitHub repository: https://github.com/PacktPublishing/Hands-On-Web-Scraping-with-Python-Second-Edition/tree/main/Chapter11.

Introduction to ML

Data collection, analysis, and the mining of data to extract information are major agendas of many data-related systems. Processing, analyzing, and executing mining-related functions requires processing time, evaluation, and interpretation to reach the desired state. Using ML, systems can be trained on relevant or sample data and ML can be further used to evaluate and interpret other data or datasets for the final output.

ML-based processing is implemented similarly to and can be compared to data mining and predictive modeling, for example, classifying emails in an inbox as spam and not spam. Spam detection is a kind of decision-making to classify emails according to their content. A system or spam-detecting algorithm is trained on inputs or datasets and can distinguish emails as spam or not.

ML predictions and decision-making models are dependent on data. ML models can be built on top of, and also use, several algorithms, which allows the system to provide...

ML using scikit-learn

To develop a model, we need datasets. Web scraping is again the perfect technique to collect the desired data and store it in the relevant format. There are plenty of ML-related libraries and frameworks available in Python, and they are growing in number. scikit-learn is a Python library that addresses and helps to deal with the majority of supervised ML features.

scikit-learn is also known and used as sklearn. It is built upon numpy, scipy, and matplotlib. The library provides a large number of features related to ML aspects such as classification, clustering, regression, and preprocessing. We will explore beginner and intermediate concepts of the supervised learning type with regression using scikit-learn. You can also explore the sklearn user guide available at https://scikit-learn.org/stable/user_guide.html.

We have covered a lot of information about regression in previous sections of this chapter. Regression is a supervised learning technique that is...

Summary

Python programming makes a huge contribution in AI- and ML-related domains. In this chapter, we have had only a glimpse of that. Quality data plays a very important role in ML. Whether collecting data via web scraping and storing it or providing scraped data on the fly to an ML model, prepared data is in demand. The better the quality of the data – and the more precise the data is – that we provide to ML algorithms, and for plotting charts, the more accurate results, visualizations, and descriptive plots we can expect.

We have now learned about ML concepts and various aspects of ML by exploring them. We have also learned how to implement ML models and collect the results, if required, from various processes. To summarize, we now have an overview of how to use scikit-learn and conduct sentiment analysis. ML is data-driven and quality data is a basic requirement for ML models to provide accuracy.

In the next chapter, we will learn about a few further steps...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Hands-On Web Scraping with Python - Second Edition
Published in: Oct 2023 Publisher: Packt ISBN-13: 9781837636211
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}