Home Web Development Hands-On Web Scraping with Python - Second Edition

Hands-On Web Scraping with Python - Second Edition

By Anish Chapagain
ai-assist-svg-icon Book + AI Assistant
eBook + AI Assistant $35.99 $24.99
Print $44.99
Subscription $15.99 $10 p/m for three months
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription. BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime! ai-assist-svg-icon NEW: AI Assistant (beta) Available with eBook, Print, and Subscription.
eBook + AI Assistant $35.99 $24.99
Print $44.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Web Scraping Fundamentals
About this book
Web scraping is a powerful tool for extracting data from the web, but it can be daunting for those without a technical background. Designed for novices, this book will help you grasp the fundamentals of web scraping and Python programming, even if you have no prior experience. Adopting a practical, hands-on approach, this updated edition of Hands-On Web Scraping with Python uses real-world examples and exercises to explain key concepts. Starting with an introduction to web scraping fundamentals and Python programming, you’ll cover a range of scraping techniques, including requests, lxml, pyquery, Scrapy, and Beautiful Soup. You’ll also get to grips with advanced topics such as secure web handling, web APIs, Selenium for web scraping, PDF extraction, regex, data analysis, EDA reports, visualization, and machine learning. This book emphasizes the importance of learning by doing. Each chapter integrates examples that demonstrate practical techniques and related skills. By the end of this book, you’ll be equipped with the skills to extract data from websites, a solid understanding of web scraping and Python programming, and the confidence to use these skills in your projects for analysis, visualization, and information discovery.
Publication date:
October 2023
Publisher
Packt
Pages
324
ISBN
9781837636211

 

Web Scraping Fundamentals

This book about web scraping covers practical concepts with detailed explanations and example code. We will introduce you to the essential topics in extracting or scraping data (that is, high-quality data) from websites, using effective techniques from the web and the Python programming language.

In this chapter, we are going to understand basic concepts related to web scraping. Whether or not you have any prior experience in this domain, you will easily be able to proceed with this chapter.

The discussion of the web or websites in our context refers to pages or documents including text, images, style sheets, scripts, and video contents, built using a markup language such as HTML. It’s almost a container of various content.

The following are a couple of common queries in this context:

  • Why web scraping?
  • What is it used for?

Most of us will have come across the concept of data and the benefits or usage of data in deriving information, decision-making, gaining insights from facts, or even knowledge discovery. There has been growing demand for data, or high-quality data, in most industries globally (such as governance, medical sciences, artificial intelligence, agriculture, business, sport, and R&D).

We will learn what exactly web scraping is, explore the techniques and technologies it is associated with, and find and extract data from the web, with the help of the Python programming language, in the chapters ahead.

In this chapter, we are going to cover the following main topics:

  • What is web scraping?
  • Understanding the latest web technologies
  • Data-finding techniques
 

Technical requirements

You can use any Operating System (OS) (such as Windows, Linux, or macOS) along with an up-to-date web browser (such as Google Chrome or Mozilla Firefox) installed on your PC or laptop.

 

What is web scraping?

Scraping is a process of extracting, copying, screening, or collecting data. Scraping or extracting data from the web (a buildup of websites, web pages, and internet-related resources) for certain requirements is normally called web scraping. Data collection and analysis are crucial in information gathering, decision-making, and research-related activities. However, as data can be easily manipulated, web scraping should be carried out with caution.

The popularity of the internet and web-based resources is causing information domains to evolve every day, which is also leading to growing demand for raw data. Data is a basic requirement in the fields of science and technology, and management. Collected or organized data is processed, analyzed, compared with historical data, and trained using Machine Learning (ML) with various algorithms and logic to obtain estimations and information and gain further knowledge.

Web scraping provides the tools and techniques to collect data from websites, fit for either personal or business-related needs, but with legal considerations.

As seen in Figure 1.1, we obtain data from various websites based on our needs, write/execute crawlers, collect necessary content, and store it. On top of this collected data, we do certain analyses and come up with some information related to decision-making.

Figure 1.1: Web scraping – storing  web content as data

Figure 1.1: Web scraping – storing web content as data

We will explore more about scraping and the analysis of data in later chapters.

There are some legal factors that are also to be considered before performing scraping tasks. Most websites contain pages such as Privacy Policy, About Us, and Terms and Conditions, where information on legal action and prohibited content, as well as general information, is available. It is a developer’s ethical duty to comply with these terms and conditions before planning any scraping activities on a website.

Important note

Scraping, web scraping, and crawling are terms that are generally used interchangeably in both the industry and this book. However, they have slightly different meanings. Crawling, also known as spidering, is a process used to browse through the links on websites and is often used by search engines for indexing purposes, whereas scraping is mostly related to content extraction from websites.

You now have a basic understanding of web scraping. We will try to explore and understand the latest web-based technologies that are extremely helpful in web scraping in the upcoming section.

 

Understanding the latest web technologies

A web page is not only a document or container of content. The rapid development in computing and web-related technologies today has transformed the web, with more security features being implemented and the web becoming a dynamic, real-time source of information. Many scraping communities gather historic data; some analyze hourly data or the latest obtained data.

At our end, we (users) use web browsers (such as Google Chrome, Mozilla Firefox, and Safari) as an application to access information from the web. Web browsers provide various document-based functionalities to users and contain application-level features that are often useful to web developers.

Web pages that users view or explore through their browsers are not just single documents. Various technologies exist that can be used to develop websites or web pages. A web page is a document that contains blocks of HTML tags. Most of the time, it is built with various sub-blocks linked as dependent or independent components from various interlinked technologies, including JavaScript and Cascading Style Sheets (CSS).

An understanding of the general concepts of web pages and the techniques of web development, along with the technologies found inside web pages, will provide more flexibility and control in the scraping process. A lot of the time, a developer can also employ reverse-engineering techniques.

Reverse engineering is an activity that involves breaking down and examining the concepts that were required to build certain products. For more information on reverse engineering, please refer to the GlobalSpec article How Does Reverse Engineering Work?, available at https://insights.globalspec.com/article/7367/how-does-reverse-engineering-work.

Here, we will introduce and explore a few of the available web technologies that can help and guide us in the process of data extraction.

HTTP

Hypertext Transfer Protocol (HTTP) is an application protocol that transfers resources (web-based), such as HTML documents, between a client and a web server. HTTP is a stateless protocol that follows the client-server model. Clients (web browsers) and web servers communicate or exchange information using HTTP requests and HTTP responses, as seen in Figure 1.2:

Figure 1.2: HTTP (client and server or request-response communication)

Figure 1.2: HTTP (client and server or request-response communication)

Requests and responses are cyclic in nature – they are like questions and answers from clients to the server, and vice versa.

Another encrypted and more secure version of the HTTP protocol is Hypertext Transfer Protocol Secure (HTTPS). It uses Secure Sockets Layer (SSL) (learn more about SSL at https://developer.mozilla.org/en-US/docs/Glossary/SSL) and Transport Layer Security (TLS) (learn more about TLS at https://developer.mozilla.org/en-US/docs/Glossary/TLS) to communicate encrypted content between a client and a server. This type of security allows clients to exchange sensitive data with a server in a safe manner. Activities such as banking, online shopping, and e-payment gateways use HTTPS to make sensitive data safe and prevent it from being exposed.

Important note

An HTTP request URL begins with http://, for example, http://www.packtpub.com, and an HTTPS request URL begins with https://, such as https://www.packpub.com.

You have now learned a bit about HTTP. In the next section, you will learn about HTTP requests (or HTTP request methods).

HTTP requests (or HTTP request methods)

Web browsers or clients submit their requests to the server. Requests are forwarded to the server using various methods (commonly known as HTTP request methods), such as GET and POST:

  • GET: This is the most common method for requesting information. It is considered a safe method as the resource state is not altered here. Also, it is used to provide query strings, such as https://www.google.com/search?q=world%20cup%20football&source=hp, which is requesting information from Google based on the q (world cup football) and source (hp) parameters sent with the request. Information or queries (q and source in this example) with values are displayed in the URL.
  • POST: Used to make a secure request to the server. The requested resource state can be altered. Data posted or sent to the requested URL is not visible in the URL but rather transferred with the request body. It is used to submit information to the server in a secure way, such as for logins and user registrations.

We will explore more about HTTP methods in the Implementing HTTP methods section of Chapter 2.

There are two main parts to HTTP communication, as seen in Figure 1.2. With a basic idea about HTTP requests, let’s explore HTTP responses in the next section.

HTTP responses

The server processes the requests, and sometimes also the specified HTTP headers. When requests are received and processed, the server returns its response to the browser. Most of the time, responses are found in HTML format, or even, in JavaScript and other document types, in JavaScript Object Notation (JSON) or other formats.

A response contains status codes, the meaning of which can be revealed using Developer Tools (DevTools). The following list contains a few status codes along with some brief information about what they mean:

  • 200: OK, request succeeded
  • 404: Not found, requested resource cannot be found
  • 500: Internal server error
  • 204: No content to be sent
  • 401: Unauthorized request was made to the server

There are also some groups of responses that can be identified from a range of HTTP response statuses:

  • 100–199: Informational responses
  • 200–299: Successful responses
  • 300–399: Redirection responses
  • 400–499: Client error
  • 500–599: Server error

Important note

For more information on cookies, HTTP, HTTP responses, and status codes, please consult the official documentation at https://www.w3.org/Protocols/ and https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

Now that we have a basic idea about HTTP responses and requests, let us explore HTTP cookies (one of the most important factors in web scraping).

HTTP cookies

HTTP cookies are data sent by the server to the browser. This data is generated and stored by websites on your system or computer. It helps to identify HTTP requests from the user to the website. Cookies contain information regarding session management, user preferences, and user behavior.

The server identifies and communicates with the browser based on the information stored in the cookies. Data stored in cookies helps a website to access and transfer certain saved values, such as the session ID and expiration date and time, providing a quick interaction between the web request and response.

Figure 1.3 displays the list of request cookies from https://www.fifa.com/fifaplus/en, collected using Chrome DevTools:

Figure 1.3: Request cookies

Figure 1.3: Request cookies

We will explore and collect more information about and from browser-based DevTools in the upcoming sections and Chapter 3.

Important note

For more information about cookies, please visit About Cookies at http://www.aboutcookies.org/ and All About Cookies at http://www.allaboutcookies.org/.

Similar to the role of cookies, HTTP proxies are also quite important in scraping. We will explore more about proxies in the next section, and also in some later chapters.

HTTP proxies

A proxy server acts as an intermediate server between a client and the main web server. The web browser sends requests to the server that are actually passed through the proxy, and the proxy returns the response from the server to the client.

Proxies are often used for monitoring/filtering, performance improvement, translation, and security for internet-related resources. Proxies can also be bought as a service, which may also be used to deal with cross-domain resources. There are also various forms of proxy implementation, such as web proxies (which can be used to bypass IP blocking), CGI proxies, and DNS proxies.

You can buy or have a contract with a proxy seller or a similar organization. They will provide you with various types of proxies according to the country in which you are operating. Proxy switching during crawling is done frequently – a proxy allows us to bypass restricted content too. Normally, if a request is routed through a proxy, our IP is somewhat safe and not revealed as the receiver will just see the third-party proxy in their detail or server logs. You can even access sites that aren’t available in your location (that is, you see an access denied in your country message) by switching to a different proxy.

Cookie-based parameters that are passed in using HTTP GET requests, HTML form-related HTTP POST requests, and modifying or adapting headers will be crucial in managing code (that is, scripts) and accessing content during the web scraping process.

Important note

Details on HTTP, headers, cookies, and so on will be explored more in an upcoming section, Data-finding techniques used in web pages. Please visit the HTTP page in the MDN web docs (https://developer.mozilla.org/en-US/docs/Web/HTTP) for more detailed information on HTTP and related concepts. Please visit https://www.softwaretestinghelp.com/best-proxy-server/ for information on the best proxy server.

You now understand general concepts regarding HTTP (including requests, responses, cookies, and proxies). Next, we will understand the technology that is used to create web content or make content available in some predefined formats.

HTML

Websites are made up of pages or documents containing text, images, style sheets, and scripts, among other things. They are often built with markup languages such as Hypertext Markup Language (HTML) and Extensible Hypertext Markup Language (XHTML).

HTML is often referred to as the standard markup language used for building a web page. Since the early 1990s, HTML has been used independently as well as in conjunction with server-based scripting languages, such as PHP, ASP, and JSP. XHTML is an advanced and extended version of HTML, which is the primary markup language for web documents. XHTML is also stricter than HTML, and from a coding perspective, is also known as an application built with Extensible Markup Language (XML).

HTML defines and contains the content of a web page. Data that can be extracted, and any information-revealing data sources, can be found inside HTML pages within a predefined instruction set or markup elements called tags. HTML tags are normally a named placeholder carrying certain predefined attributes, for example, <a>, <b>, <table>, <img>, and <script>.

HTML is a container or type of markup language. Various factors are involved in building HTML; the next section defines these factors with some examples.

HTML elements and attributes

HTML elements (also referred to as document nodes) are the building blocks of web documents. HTML elements are built with a start tag, <..>, and an end tag, </..>, with certain content inside them. An HTML element can also contain attributes, usually defined as attribute-name = attribute-value, which provide additional information to the element:

<p>normal paragraph tags</p>
<h1>heading tags there are also h2, h3, h4, h5, h6</h1>
<a href="https://www.google.com">Click here for Google.com</a>
<img src="myphoto1.jpg" width="300" height="300" alt="Picture" />
<br />

The preceding code can be broken down as follows:

  • <p> and <h1> are HTML elements containing general text information (element content).
  • <a> is defined with an href attribute that contains the actual link that will be processed when the text Click here for Google.com is clicked. The link refers to https://www.google.com/.
  • The <img> image tag also contains a few attributes, such as src and alt, along with their respective values. src holds the resource, which means the image address or image URL, as a value, whereas alt holds the value for alternative text (mostly displayed when there is a slow connection or the image is not able to load) for <img>.
  • <br/> represents a line break in HTML and has no attributes or text content. It is used to insert a new line in the layout of the document.

HTML elements can also be nested in a tree-like structure with a parent-child hierarchy, as follows:

<div class="article">
  <p id="mainContent" class="content">
    <b>Paragraph Content</b>
      <img src="mylogo.png" id="pageLogo" alt="Logo"
        class="logo"/>
  </p>
  <p>
    <h3> Paragraph Title: Web Scraping</h3>
  </p>
</div>

As seen in the preceding code, two <p> child elements are found inside an HTML <div> block. Both child elements carry certain attributes and various child elements as their content. Normally, HTML documents are built with the aforementioned structure.

As seen in the preceding code block in the last example, there are a few extra key-value pairs. The next section explores this.

Global attributes

HTML elements can contain some additional information, such as key-value pairs. These are also known as HTML element attributes. Attributes hold values and provide identification, or contain additional information that can be helpful in many aspects during scraping activities, such as identifying exact web elements and extracting values or text from them and traversing (moving along) elements.

There are certain attributes that are common to HTML elements or can be applied to all HTML elements. The following list mentions some of the attributes that are identified as global attributes (https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes):

  • id: This attribute’s values should be unique to the element they are applied to
  • class: This attribute’s values are mostly used with CSS, providing equal state formatting options, and can be used with multiple elements
  • style: This specifies inline CSS styles for an element
  • lang: This helps to identify the language of the text

Important note

The id and class attributes are mostly used to identify or format individual elements or groups of them. These attributes can also be managed by CSS and other scripting languages. These attributes can be identified by placing # and ., respectively, in front of the attribute name when used with CSS, or while traversing and applying parsing techniques.

HTML element attributes can also be overwritten or implemented dynamically using scripting languages. As displayed in the following example, itemprop attributes are used to add properties to an element, whereas data-* is used to store data that is native to the element itself:

<div itemscope itemtype="http://schema.org/Place">
   <h1 itemprop="univeristy">University of Helsinki</h1>
   <span>Subject: <span itemprop="subject1">Artificial
      Intelligence</span>
   </span><span itemprop="subject2">Data Science</span>
</div>
<img class="dept" src="logo.png" data-course-id="324" datatitle="Predictive  Analysis" data-x="12345" data-y="54321" data-z="56743" onclick="schedule.load()"/>

HTML tags and attributes are very helpful when extracting data.

Important note

Please visit https://www.w3.org or https://www.w3schools.com/html for more detailed information on HTML.

In Chapter 3, we will explore these attributes using different tools. We will also perform various logical operations and use them for extracting or scraping purposes.

We now have some idea about HTML and a few important attributes related to HTML. In the next section, we will learn the basics of XML, also known as the parent of markup languages.

XML

XML is a markup language used for distributing data over the internet, with a set of rules for encoding documents that are readable and easily exchangeable between machines and documents. XML files are recognized by the .xml extension.

XML emphasizes the usability of textual data across various formats and systems. XML is designed to carry portable data or data stored in tags that is not predefined with HTML tags. In XML documents, tags are created by the document developer or an automated program to describe the content.

The following code displays some example XML content:

<employees>
  <employee>
    <fullName>Shiba Chapagain</fullName>
    <gender>Female</gender>
  </employee>
  <employee>
    <fullName>Aasira Chapagain</fullName>
    <gender>Female</gender>
  </employee>
</employees>

In the preceding code, the <employees> parent node has two <employee> child nodes, which in turn contain the other child nodes of <fullName> and <gender>.

XML is an open standard, using the Unicode character set. XML is used to share data across various platforms and has been adopted by various web applications. Many websites use XML data, implementing its contents with the use of scripting languages and presenting it in HTML or other document formats for the end user to view.

Extraction tasks from XML documents can also be performed to obtain the contents in the desired format, or by filtering the requirement with respect to a specific need for data. Plus, behind-the-scenes data may also be obtained from certain websites only.

Important note

Please visit https://www.w3.org/XML/ and https://www.w3schools.com/xml/ for more information on XML.

So far, we have explored content placing and content holding related technologies based on markup languages such as HTML and XML. These technologies are somewhat static in nature. The next section is about JavaScript, which provides dynamism to the web with the help of scripts.

JavaScript

JavaScript (also known as JS or JScript) is a programming language used to program HTML and web applications that run in the browser. JavaScript is mostly preferred for adding dynamic features and providing user-based interaction inside web pages. JavaScript, HTML, and CSS are among the most-used web technologies, and now they are also used with headless browsers (you can read more about headless browsers at https://oxylabs.io/blog/what-is-headless-browser). The client-side availability of the JavaScript engine has also strengthened its usage in application testing and debugging.

<script> contains programming logic with JavaScript variables, operators, functions, arrays, loops, conditions, and events, targeting the HTML Document Object Model (DOM). JavaScript code can be added to HTML using <script>, as seen in the following code, or can also be embedded as a file:

<!DOCTYPE html>
<html>
<head>
   <script>
      function placeTitle() {
         document.getElementById("innerDiv").innerHTML =
            "Welcome to WebScraping";
      }
   </script>
</head>
<body>
   <div>Press the button: <p id="innerDiv"></p></div>
   <button id="btnTitle" name="btnTitle" type="submit"
      onclick="placeTitle()">
      Load Page Title!
   </button>
</body>
</html>

As seen in the preceding code, the HTML <head> tag contains <script> with the placeTitle() JavaScript function. The function defined fires up the event as soon as <button> is clicked and changes the content for <p> with id=innerDIV (this particular element is defined as empty) to display the text Welcome to WebScraping.

Important note

The HTML DOM is a standard for how to get, change, add, or delete HTML elements. Please visit the page on JavaScript HTML DOM on W3Schools (https://www.w3schools.com/js/js_htmldom.asp) for more detailed information.

The dynamic manipulation of HTML content, elements, attribute values, CSS, and HTML events with accessible internal functions and programming features makes JavaScript very popular in web development. There are many web-based technologies related to JavaScript, including JSON, JavaScript Query (jQuery), AngularJS, and Asynchronous JavaScript and XML (AJAX), among many more. Some of these will be discussed in the following subsections.

jQuery

jQuery, or more specifically JavaScript-based DOM-related query, is a JavaScript library that addresses incompatibilities across browsers, providing API features to handle the HTML DOM, events, and animations. jQuery has been acclaimed globally for providing interactivity to the web and the way JavaScript is used to code. jQuery is lightweight in comparison to the JavaScript framework. It is also easy to implement and takes a short and readable coding approach.

jQuery is a huge topic and will require adequate knowledge of JavaScript before embarking on it. A jQuery-like Python-based library will be used by us in Chapter 4.

Important note

For more information on jQuery, please visit https://www.w3schools.com/jquery/ and http://jquery.com/.

jQuery is mostly used for DOM-based activities, as discussed in this section, whereas AJAX is a collection of technologies, which we are going to learn about in the next section.

AJAX

AJAX is a web development technique that uses a group of web technologies on the client side to create asynchronous web applications.

JavaScript XMLHttpRequest (XHR) objects are used to execute AJAX on web pages and load page content without refreshing or reloading the page. Please visit the AJAX page on W3Schools (https://www.w3schools.com/js/js_ajax_intro.asp) for more information on AJAX. From a scraping point of view, a basic overview of JavaScript functionality will be valuable to understand how a page is built or manipulated, as well as to identify the dynamic components used.

Important note

Please visit https://developer.mozilla.org/en-US/docs/Web/JavaScript, https://www.javascript.com/, https://www.w3schools.com/js/js_intro.asp, and https://www.w3schools.com/js/js_ajax_intro.asp for more information on JavaScript and AJAX.

We have learned about a few JavaScript-based techniques and technologies that are commonly deployed in web development today. In the next section, we will learn about data-storing objects.

JSON

JSON is a format used for storing and transporting data from a server to a web page. It is language-independent and preferred in web-based data interchange actions due to its size and readability. JSON files are files that have the .json extension.

JSON data is normally formatted as a name:value pair, which is evaluated as a JavaScript object and follows JavaScript operations. JSON and XML are often compared, as they both carry and exchange data between various web resources. JSON is usually ranked higher than XML for its structure, which is simple, readable, self-descriptive, understandable, and easy to process.

For web applications using JavaScript, AJAX, or RESTful services, JSON is preferred over XML due to its fast and easy operation. JSON and JavaScript objects are interchangeable. JSON is not a markup language, and it doesn’t contain any tags or attributes. Instead, it is a text-only format that can be accessed through a server, as well as being able to be managed by any programming language.

JSON objects can also be expressed as arrays, dictionaries, and lists:

{"mymembers":[
{ "firstName":"Aasira", "lastName":"Chapagain","cityName":"Kathmandu"},
{ "firstName":"Rakshya", "lastName":"Dhungel","cityName":"New Delhi"},
{ "firstName":"Shiba", "lastName":"Paudel","cityName":"Biratnagar"},
]}

You have learned about JSON, which is a content holder. In the following section, we will discuss HTML styling using CSS and providing HTML tags with extra identification.

Important note

JSON is also known for the mixture of dictionary and list objects it provides in Python. JSON is written as a string, and we can find plenty of websites that convert JSON strings into JSON objects, for example, https://jsonformatter.org/, https://jsonlint.com/, and https://www.freeformatter.com/json-formatter.html.

Please visit http://www.json.org/, https://jsonlines.org/, and https://www.w3schools.com/js/js_json_intro.asp for more information regarding JSON and JSON Lines.

CSS

The web-based technologies we have introduced so far deal with content, including binding, development, and processing. CSS describes the display properties of HTML elements and the appearance of web pages. CSS is used for styling and providing the desired appearance and presentation of HTML elements.

By using CSS, developers/designers can control the layout and presentation of a web document. CSS can be applied to a distinct element in a page, or it can be embedded through a separate document. Styling details can be described using the <style> tag.

The <style> tag can contain details targeting repeated and various elements in a block. As seen in the following code, multiple <a> elements exist, and it also possesses the class and id global attributes:

<html>
<head>
<style>
a{color:blue;}
h1{color:black; text-decoration:underline;}
#idOne{color:red;}
.classOne{color:orange;}
</style>
</head>
<body>
<h1> Welcome to Web Scraping </h1>Links:<a href="https://www.google.com"> Google </a> &nbsp;
<a class='classOne' href="https://www.yahoo.com"> Yahoo </a>
<a id='idOne' href="https://www.wikipedia.org"> Wikipedia </a>
</body>
</html>

Attributes that are provided with CSS properties or have been styled inside <style> tags in the preceding code block will result in the output shown in Figure 1.4:

Figure 1.4: Output of the HTML code using CSS

Figure 1.4: Output of the HTML code using CSS

Although CSS is used to manage the appearance of HTML elements, CSS selectors (patterns used to select elements or the position of elements) often play a major role in the scraping process. We will be exploring CSS selectors in detail in Chapter 3.

Important note

Please visit https://www.w3.org/Style/CSS/ and https://www.w3schools.com/css/ for more detailed information on CSS.

In this section, you were introduced to some of the technologies that can be used for web scraping. In the upcoming section, you will learn about data-finding techniques. Most of them are built with web technologies you have already been introduced to.

 

Data-finding techniques used in web pages

To extract data from websites or web pages, we must identify where exactly the data is located. This is the most important step in the case of automating data collection from the web.

When we browse or request any URL in a web browser, we can see the contents as responses to us. These contents can be some dynamically added values or dynamically generated or rendered to the HTML templates by processing some API or JavaScript code. Knowing the URL of response content or finding the availability of content in some files is the first action toward scraping. Content can also be retrieved using third-party sources or sometimes even embedded in a view to end users.

In this section, we will explore a few key techniques that will help us identify, search for, and locate contents we have received via a web browser.

HTML source page

Web browsers are used for client-server-based GUI interaction to explore web content. The browser address bar is supplied with the web address or URL, the requested URL is communicated to the server (host), and a response is received, which means it is loaded by the browser. This obtained response or page source can be further explored and searched for the desired content in raw format.

Important note

You are free to choose which web browser you wish to use. Most web browsers will display the same or similar content. We will be using Google Chrome for most of the book’s examples, installed on the Windows OS.

To access the HTML source page, follow these steps:

  1. Open https://www.google.com in your web browser (you can try the same scenario with any other URL).
  2. After the page is loaded completely, right-click on any section of the page. The menu shown in Figure 1.5 should be visible, with the View page source option:
Figure 1.5: View page source (right-click on any page and find this option)

Figure 1.5: View page source (right-click on any page and find this option)

  1. If you click the View page source option, it will load a new tab in the browser, as seen in Figure 1.6:
Figure 1.6: Page source (new tab loaded in the web browser, with raw HTML)

Figure 1.6: Page source (new tab loaded in the web browser, with raw HTML)

You can see that a new tab will be added to the browser with the text view-source: prepended to the original URL, https://www.google.com. Also, if we add the text view-source: to our URL, once the URL loads, it displays the page source or raw HTML.

Important note

You can try to find any text or DOM element in the web browser by searching inside the page source. Load the URL https://www.google.com and search web scraping. Find some of the content displayed by Google using the page source.

We now possess a basic idea of data-finding techniques. The technique we used in this section is a primary or base concept. There are a few more techniques that are more sophisticated and come with a large set of functionality and tools, which help or guide us in the data-finding context – we will cover them in the next section.

Developer tools

DevTools are found embedded within most browsers on the market today. Developers and end users alike can identify and locate resources and search for web content that is used during client-server communication, or while engaged in an HTTP request and response.

DevTools allow a user to examine, create, edit, and debug HTML, CSS, and JavaScript. They also allow us to handle and figure out performance problems. They facilitate the extraction of data that is dynamically or securely presented by the browser.

DevTools will be used for most data extraction cases. For more detailed information on DevTools, here are some links:

Similar to the View page source option, as discussed in the HTML source page section, we can find the Inspect menu option, which is another option for viewing the page source, when we right-click on a web page.

Alternatively, you can access DevTools via the main menu in the browser. Click More tools | Developer tools, or press Ctrl + Shift + I, as seen in Figure 1.7:

Figure 1.7: Accessing DevTools (web browser menu bar)

Figure 1.7: Accessing DevTools (web browser menu bar)

Let’s try loading the URL https://en.wikipedia.org/wiki/FIFA in the web browser. After the page gets loaded, follow these steps:

  1. Right-click the page and click the Inspect menu option.

We’ll notice a new menu section with tabs (Elements, Console, Sources, Network, Memory, and more) appearing in the browser, as seen in Figure 1.8:

Figure 1.8: Inspecting the DevTools panels

Figure 1.8: Inspecting the DevTools panels

  1. Press Ctrl + Shift + I to access the DevTools or click the Network tab from the Inspect menu option, as shown in Figure 1.9:
Figure 1.9: DevTools Network panel

Figure 1.9: DevTools Network panel

Important note

The Search and Filter fields, as seen in Figure 1.9, are often used to find content in the HTML page source or other available resources that are available in the Network panel. The Search box can be supplied with a regex pattern – case-sensitive information to find or locate content statically or dynamically.

All panels and tools found inside DevTools have a designated role. Let’s get a basic overview of a few important ones next.

Exploring DevTools

Here is a list of all the panels and tools found in DevTools:

  • Elements: Displays the HTML content of the page viewed. This is used for viewing and editing the DOM and CSS, and for finding CSS selectors and XPath content. Figure 1.10 shows the icon as found in the Inspect menu option, which can be clicked and moved to the HTML content in the page or code inside the Elements panel, to locate HTML tags or XPath and DOM element positions:
Figure 1.10: Element inspector or selector

Figure 1.10: Element inspector or selector

This icon acts similarly to the mouse cursor moving across the screen. We will explore CSS selectors and XPath further in Chapter 3.

Important note

HTML elements displayed or located in the Elements or Network | Doc panel may not be available in the page source.

  • Console: Used to run and interact with JavaScript code, and to view log messages.
  • Sources: Used to navigate pages and view available scripts and document sources. Script-based tools are available for tasks such as script execution (that is, resuming and pausing), stepping over function calls, activating and deactivating breakpoints, and handling exceptions.
  • Network: Provides us with HTTP request and response-related resources. Resources found here feature options such as recording data to network logs, capturing screenshots, filtering web resources (JavaScript, images, documents, and CSS), searching web resources, and grouping web resources, and can also be used for debugging tasks. Figure 1.11 displays the HTTP request URL, request method, status code, and more, by accessing the Headers tab from the Doc option available inside the Network panel.
Figure 1.11: DevTools – Network | Doc | Headers option (HTTP method and status code)

Figure 1.11: DevTools – Network | Doc | Headers option (HTTP method and status code)

Network-based requests can also be filtered by the following types:

  • All: Lists all requests related to the network, including document requests, images, fonts, and CSS. Resources are placed in the order of them being loaded.
  • Fetch/XHR: Lists XHR objects. This option lists dynamically loaded resources, such as API and AJAX content.
  • JS: Lists JavaScript files involved in the request and response cycle.
  • CSS: Lists all style files.
  • Img: Lists image files and their details.
  • Doc: Lists requested HTML or web-related documents.
  • WS: Lists WebSocket-related entries and their details.
  • Other: Lists any unfiltered type of request-related resources.

For each of the filter options just listed, there are some child tabs for selected resources in the Name panel, which are as follows:

  • Headers: Loads HTTP/HTTPS header data for a particular request. A few important and automation-based types of data are also found, for example, request URL, method, status code, request/response headers, query string, payload, or POST information.
  • Preview: Provides a preview of the response found, similar to the entities viewed in the web browser.
  • Response: Loads the response from particular entities. This tab shows the HTML source for HTML pages, JavaScript code for JavaScript files, and JSON or CSV data for similar documents. It actually shows the raw source of the content.
  • Initiator: Provides the initiator links or chains of initiator URLs. It is similar to the referer in the request headers.
  • Timing: Shows a breakdown of the time between resource scheduling, when the connection starts, and the request/response.
  • Cookies: Provides cookie-related information, its keys and values, and expiration dates.

Important note

The Network panel is one of the most important resource hubs. We can find/trace plenty of information and supporting details for each request/response cycle in this panel. For more detailed information on the Network panel, please visit https://developer.chrome.com/docs/devtools/network/ and https://firefox-source-docs.mozilla.org/devtools-user/network_monitor/.

  • Performance: Screenshots and a memory timeline can be recorded. The visual information obtained is used to optimize the website speed, improve load times, and analyze the runtime or overall performance.
  • Memory: Information obtained from this panel is used to fix memory issues and track down memory leaks. Overall, the details from the Performance and Memory panels allow developers to analyze website performance and embark on further planning related to optimization.
  • Application: The end user can inspect and manage storage for all loaded resources during page loading. Information related to cookies, sessions, application cache, images, databases on the fly, and more can be viewed and even deleted to create a fresh session.
  • Security: This panel might not be available in all web browsers. It normally shows security-related information, such as resources, certificates, and connections. We can even browse more about certificate details, from a few detail links or buttons available in this panel, as shown here in Figure 1.12:
 Figure 1.12: Security panel (details about certificate, connection, and resources)

Figure 1.12: Security panel (details about certificate, connection, and resources)

After exploring the HTML page source and DevTools, we now have an idea about where data and request/response-related information is stored, and how we can access it. Overall, the scraping process involves extracting data from web pages, and we need to identify or locate the resources with data or those that can carry data. Before proceeding with data exploration and content identification, it is beneficial to identify the page URL, DevTools resources, XHR, JavaScript, and a general overview of browser-based activities.

Finally, there are more topics related to links, child pages, and more. We will be using techniques such as Sitemaps.xml and robots.txt in depth in Chapter 3.

Important note

For basic concepts related to sitemaps.xml and robots.txt, please visit the Sitemaps site (https://www.sitemaps.org) and the Robots Exclusion Protocol site (http://www.robotstxt.org).

In this chapter, you have learned about web scraping, selected web technologies that are involved, and how data-finding techniques are used.

 

Summary

Websites are dynamic in nature, so the fundamental activities introduced in this chapter will be applicable in most cases. We also explained and explored some of the core technologies related to the World Wide Web (WWW) and web scraping. Identifying or finding content with the use of DevTools and page sources for targeted content was the focus of this chapter. This information will guide you through various aspects of taking primary and professional steps in web scraping.

In the next chapter, we will be using the Python programming language to interact with the web or data sources and explore a few main libraries that we have chosen for data extraction.

 
About the Author
  • Anish Chapagain

    Anish Chapagain is a software engineer with a passion for data science, its processes, and Python programming, which began around 2007. He has been working with web scraping and analysis-related tasks for more than 5 years, and is currently pursuing freelance projects in the web scraping domain. Anish previously worked as a trainer, web/software developer, and as a banker, where he was exposed to data and gained further insights into topics including data analysis, visualization, data mining, information processing, and knowledge discovery. He has an MSc in computer systems from Bangor University (University of Wales), United Kingdom, and an Executive MBA from Himalayan Whitehouse International College, Kathmandu, Nepal.

    Browse publications by this author
Hands-On Web Scraping with Python - Second Edition
Unlock this book and the full library FREE for 7 days
Start now