You're reading from Hands-On Web Scraping with Python - Second Edition

Product typeBook

Published inOct 2023

PublisherPackt

ISBN-139781837636211

Edition2nd Edition

Concepts

Web Programming

Author (1)

Anish Chapagain

Machine Learning and Web Scraping

So far, we have learned about data extraction, data storage, and acquiring and analyzing information from data by using a number of Python libraries. This chapter will provide you with introductory information on Machine Learning (ML) with a few examples.

Web scraping involves studying a website, identifying collectible data elements, and planning and processing a script to extract and collect data in datasets or files. This collected data will be cleaned and processed further to generate information or valuable insights. ML is a branch of Artificial Intelligence (AI) and generally deals with statistical and mathematical processes. ML is used to develop, train, and evaluate algorithms that can be automated, keep learning from the outputs, and minimize human intervention.

ML uses data to learn, predict, classify, and test situations, and for many other functions. Data is collected using web scraping techniques, so there is a correlation between...

Technical requirements

A web browser (Google Chrome or Mozilla Firefox) will be required and we will be using JupyterLab for Python code.

Please refer to the Setting things up and Creating a virtual environment sections of Chapter 2 to continue with setting up and using the environment created.

The Python libraries that are required for this chapter are as follows:

scikit-learn (visit https://scikit-learn.org/stable/install.html for installation)
textblob (visit https://textblob.readthedocs.io/en/dev/install.html for installation)
vaderSentiment
plotly
numpy
pandas
matplotlib

The code files for this chapter are available online in this book’s GitHub repository: https://github.com/PacktPublishing/Hands-On-Web-Scraping-with-Python-Second-Edition/tree/main/Chapter11.

Introduction to ML

Data collection, analysis, and the mining of data to extract information are major agendas of many data-related systems. Processing, analyzing, and executing mining-related functions requires processing time, evaluation, and interpretation to reach the desired state. Using ML, systems can be trained on relevant or sample data and ML can be further used to evaluate and interpret other data or datasets for the final output.

ML-based processing is implemented similarly to and can be compared to data mining and predictive modeling, for example, classifying emails in an inbox as spam and not spam. Spam detection is a kind of decision-making to classify emails according to their content. A system or spam-detecting algorithm is trained on inputs or datasets and can distinguish emails as spam or not.

ML predictions and decision-making models are dependent on data. ML models can be built on top of, and also use, several algorithms, which allows the system to provide...

ML using scikit-learn

To develop a model, we need datasets. Web scraping is again the perfect technique to collect the desired data and store it in the relevant format. There are plenty of ML-related libraries and frameworks available in Python, and they are growing in number. scikit-learn is a Python library that addresses and helps to deal with the majority of supervised ML features.

scikit-learn is also known and used as sklearn. It is built upon numpy, scipy, and matplotlib. The library provides a large number of features related to ML aspects such as classification, clustering, regression, and preprocessing. We will explore beginner and intermediate concepts of the supervised learning type with regression using scikit-learn. You can also explore the sklearn user guide available at https://scikit-learn.org/stable/user_guide.html.

We have covered a lot of information about regression in previous sections of this chapter. Regression is a supervised learning technique that is...

Summary

Python programming makes a huge contribution in AI- and ML-related domains. In this chapter, we have had only a glimpse of that. Quality data plays a very important role in ML. Whether collecting data via web scraping and storing it or providing scraped data on the fly to an ML model, prepared data is in demand. The better the quality of the data – and the more precise the data is – that we provide to ML algorithms, and for plotting charts, the more accurate results, visualizations, and descriptive plots we can expect.

We have now learned about ML concepts and various aspects of ML by exploring them. We have also learned how to implement ML models and collect the results, if required, from various processes. To summarize, we now have an overview of how to use scikit-learn and conduct sentiment analysis. ML is data-driven and quality data is a basic requirement for ML models to provide accuracy.

In the next chapter, we will learn about a few further steps...

Anish Chapagain is a software engineer with a passion for data science, its processes, and Python programming, which began around 2007. He has been working with web scraping and analysis-related tasks for more than 5 years, and is currently pursuing freelance projects in the web scraping domain. Anish previously worked as a trainer, web/software developer, and as a banker, where he was exposed to data and gained further insights into topics including data analysis, visualization, data mining, information processing, and knowledge discovery. He has an MSc in computer systems from Bangor University (University of Wales), United Kingdom, and an Executive MBA from Himalayan Whitehouse International College, Kathmandu, Nepal.
Read more about Anish Chapagain

Personalised recommendations for you

Based on your interests and search pattern

Modernizing Drupal 10 Theme Development

Modernizing Drupal 10 Theme Development covers everything a frontend developer or a Drupal developer needs to know about the Drupal 10 theme layer. Using real-world examples, this book will help you to build up from an empty theme to a fast, responsive, maintainable, and accessible Drupal website in both monolithic and decoupled ways.

BookAug 2023360 pages4

Full Stack Development with Spring Boot 3 and React

Full Stack Development with Spring Boot 3 and React contains a wealth of practical guidance for picking up full stack development. The step-by-step exploration of everything from dependency injection, ORM with Hibernate, and JWTs to RESTful APIs, UI styling, and TypeScript will help you to develop the Spring Boot and React skills you need.

BookOct 2023454 pages5

Modern API Development with Spring 6 and Spring Boot 3

This practical guide teaches inexperienced Java programmers and web developers how to design, develop, test, and deploy highly scalable and maintainable APIs using REST, gRPC, GraphQL, and reactive programming paradigms with Java and Spring Boot. Complete with real-world examples, it will guide you to build enterprise-level APIs and services.

BookSep 2023494 pages1

Hands-On Web Scraping with Python

This book guides you to discover and extract quality data from the web using Python libraries such as requests, lxml, pyquery, scrapy, and to use effective scraping techniques. It’ll help you master the fundamentals and advanced concepts of web scraping and apply your skills to real-world data extraction tasks, with analysis and visualization.

BookOct 2023324 pages

Mastering Blazor WebAssembly

Mastering Blazor WebAssembly is a comprehensive, practical guide that’ll help you to develop advanced web applications. By combining real-world experience with the thorough knowledge needed to build Blazor apps in a masterful fashion, this book enables you to efficiently manage cross-platform mobile development with .NET MAUI and Blazor.

BookAug 2023370 pages4

Mastering Blazor WebAssembly

Mastering Blazor WebAssembly is a comprehensive, practical guide that’ll help you to develop advanced web applications. By combining real-world experience with the thorough knowledge needed to build Blazor apps in a masterful fashion, this book enables you to efficiently manage cross-platform mobile development with .NET MAUI and Blazor.

BookAug 2023370 pages3

Full Stack Web Development with Remix

Unlock the power of the web platform and cutting-edge technologies for your React applications using Remix to take advantage of the full stack to create a great user experience. This book is a complete resource covering the conventions, levers, and primitives of Remix and illustrates concepts through a real-world project.

BookNov 2023318 pages

Real-World Svelte

This book is a must-have guide to unlocking Svelte's true potential. You’ll learn about advanced component development, efficient coding, and powerful UI patterns using actions. With a focus on state management, custom stores, and renderless components, this book equips you to build exceptional web apps with lightning-fast performance.

BookDec 2023282 pages3

Building Micro Frontends with React 18

This book is a step-by-step guide to building production-grade, highly scalable, and performant web apps using multi-SPA and micro apps patterns, where you’ll learn to deal with complexities related to micro frontends. The book is a full life cycle guide to building, deploying, and managing micro frontends in a cloud-native environment.

BookOct 2023218 pages

Drupal 10 Masterclass

This all-in-one guide helps you get up and running with building Drupal applications using the latest Drupal 10 features. You’ll develop a complete practical understanding of Drupal frontend, backend, architecture, content management, themes, and modules to deliver a rich user experience by creating custom Drupal apps.

BookDec 2023310 pages

Full-Stack Flask and React

This book teaches you how to build simple yet efficient production-ready web applications. Starting with an introduction to React, this book will equip you with deeper knowledge of Flask and React technology stacks to undertake web application development with confidence.

BookOct 2023408 pages5

Full-Stack Flask and React

This book teaches you how to build simple yet efficient production-ready web applications. Starting with an introduction to React, this book will equip you with deeper knowledge of Flask and React technology stacks to undertake web application development with confidence.

BookOct 2023408 pages2

You're reading from Hands-On Web Scraping with Python - Second Edition

Machine Learning and Web Scraping

Technical requirements

Introduction to ML

ML using scikit-learn

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Modernizing Drupal 10 Theme Development

Full Stack Development with Spring Boot 3 and React

Modern API Development with Spring 6 and Spring Boot 3

Hands-On Web Scraping with Python

Mastering Blazor WebAssembly

Mastering Blazor WebAssembly

Full Stack Web Development with Remix

Real-World Svelte

Building Micro Frontends with React 18

Drupal 10 Masterclass

Full-Stack Flask and React

This book teaches you how to build simple yet efficient production-ready web applications. Starting with an introduction to React, this book will equip you with deeper knowledge of Flask and React technology stacks to undertake web application development with confidence.

Full-Stack Flask and React

This book teaches you how to build simple yet efficient production-ready web applications. Starting with an introduction to React, this book will equip you with deeper knowledge of Flask and React technology stacks to undertake web application development with confidence.