You're reading from The Natural Language Processing Workshop

Product type Book

Published in Aug 2020

Publisher Packt

ISBN-13 9781800208421

Pages 452 pages

Edition 1st Edition

Languages

Python

Concepts

Mobile Application Development

Authors (6):

Rohan Chopra

Aniruddha M. Godbole

Nipun Sadvilkar

Muzaffar Bashir Shah

Sohom Ghosh

Dwight Gunning

View More author details

Table of Contents (10) Chapters

Preface

1. Introduction to Natural Language Processing

2. Feature Extraction Methods

3. Developing a Text Classifier

4. Collecting Text Data with Web Scraping and APIs

5. Topic Modeling

6. Vector Representation

7. Text Generation and Summarization

8. Sentiment Analysis

Appendix

4. Collecting Text Data with Web Scraping and APIs

Overview

This chapter introduces you to the concept of web scraping. You will first learn how to extract data (such as text, images, lists, and tables) from pages that are written using HTML. You will then learn about the various types of semi-structured data used to create web pages (such as JSON and XML) and extract data from them. Finally, you will use APIs for data extraction from Twitter, using the tweepy package.

Introduction

In the last chapter, we developed a simple classifier using feature extraction methods. We also covered different algorithms that fall under supervised and unsupervised learning. In this chapter, you will learn how to collect text data by scraping web pages, and then you will learn how to process that data. Web scraping helps you extract useful data from online content, such as product prices and customer reviews, which can then be used for market research, price comparison for products, or data analysis. You will also learn how to handle various kinds of semi-structured data, such as JSON and XML. We will cover different methods for extracting data using Application Programming Interfaces (APIs). Finally, we will explore different ways to extract data from different types of files.

Collecting Data by Scraping Web Pages

The basic building block of any web page is HTML (Hypertext Markup Language)—a markup language that specifies the structure of your content. HTML is written using a series of tags, combined with optional content. The content encompassed within HTML tags defines the appearance of the web page. It can be used to make words bold or italicize them, to add hyperlinks to the text, and even to add images. Additional information can be added to the element using attributes within tags. So, a web page can be considered to be a document written using HTML. Thus, we need to know the basics of HTML to scrape web pages effectively.

The following figure depicts the contents that are included within an HTML tag:

Figure 4.1: Tags and attributes of HTML

As you can see in the preceding figure, we can easily identify different elements within an HTML tag. The basic HTML structure and commonly used tags are shown and explained as...

Dealing with Semi-Structured Data

We learned about various types of data in Chapter 2, Feature Extraction Methods. Let's quickly recapitulate what semi-structured data refers to. A dataset is said to be semi-structured if it is not in a row-column format but, if required, can be converted into a structured format that has a definite number of rows and columns. Often, we come across data that is stored as key-value pairs or embedded between tags, as is the case with JSON (JavaScript Object Notation) and XML (Extensible Markup Language) files. These are the most popularly used instances of semi-structured data.

JSON

JSON files are used for storing and exchanging data. JSON is human-readable and easy to interpret. Just like text files and CSV files, JSON files are language-independent. This means that different programming languages, such as Python, Java, and so on, can work with JSON files effectively. In Python, a built-in data structure called a dictionary is capable of...

Summary

In this chapter, we have learned various ways to collect data by scraping web pages. We also successfully scraped data from semi-structured formats such as JSON and XML and explored different methods of retrieving data in real time from a website without authentication. In the next chapter, you will learn about topic modeling—an unsupervised natural language processing technique that helps group documents according to the topics that it detects in them.

The rest of the chapter is locked

You're reading from The Natural Language Processing Workshop

Table of Contents (10) Chapters

4. Collecting Text Data with Web Scraping and APIs

Introduction

Collecting Data by Scraping Web Pages

Dealing with Semi-Structured Data

JSON

Summary

Authors (6)

Other recommended products

Personalised recommendations for you

You're reading from The Natural Language Processing Workshop

Table of Contents (10) Chapters

4. Collecting Text Data with Web Scraping and APIs

Introduction

Collecting Data by Scraping Web Pages

Dealing with Semi-Structured Data

JSON

Summary

Unlock this book and the full library FREE for 7 days

Authors (6)

Other recommended products

Personalised recommendations for you