Reader small image

You're reading from  Python Web Scraping Cookbook

Product typeBook
Published inFeb 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781787285217
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Michael Heydt
Michael Heydt
author image
Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt

Right arrow

Text Wrangling and Analysis

In this chapter, we will cover:

  • Installing NLTK
  • Performing sentence splitting
  • Performing tokenization
  • Performing stemming
  • Performing lemmatization
  • Identifying and removing stop words
  • Calculating the frequency distribution of words
  • Identifying and removing rare words
  • Identifying and removing short words
  • Removing punctuation marks
  • Piecing together n-grams
  • Scraping a job listing from StackOverflow
  • Reading and cleaning the description in the job listCreating a word cloud from a StackOverflow job listing

Introduction

Mining the data is often the most interesting part of the job, and text is one of the most common data sources. We will be using the NLTK toolkit to introduce common natural language processing concepts and statistical models. Not only do we want to find quantitative data, such as numbers within data that we have scraped, we also want to be able to analyze various characteristics of textual information. This analysis of textual information is often lumped into a category known as natural language processing (NLP). There exists a library for Python, NLTK, that provides rich capabilities. We will investigate several of it's capabilities.

Installing NLTK

In this recipe we learn to install NTLK, the natural language toolkit for Python.

How to do it

We proceed with the recipe as follows:

  1. The core of NLTK can be installed using pip:
pip install nltk
  1. Some processes, such as those we will use, require an additional download of various data sets that they use to perform various analyses. They can be downloaded by executing the following:
import nltk
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
  1. On a Mac, this actually pops up the following window:
The NTLK GUI

Select install all and press the Download button. The tools will begin to download a number of data sets. This can take a while, so grab a coffee...

Performing sentence splitting

Many NLP processes require splitting a large amount of text into sentences. This may seem to be a simple task, but for computers it can be problematic. A simple sentence splitter can look just for periods (.), or use other algorithms such as predictive classifiers. We will examine two means of sentence splitting with NLTK.

How to do it

We will use a sentence stored in thee 07/sentence1.txt file. It has the following content, which was pulled from a random job listing on StackOverflow:

We are seeking developers with demonstrable experience in: ASP.NET, C#, SQL Server, and AngularJS. We are a fast-paced, highly iterative team that has to adapt quickly as our factory grows. We need people who...

Performing tokenization

Tokenization is the process of converting text into tokens. These tokens can be paragraphs, sentences, and common individual words, and are commonly based at the word level. NLTK comes with a number of tokenizers that will be demonstrated in this recipe.

How to do it

The code for this example is in the 07/02_tokenize.py file. This extends the sentence splitter to demonstrate five different tokenization techniques. The first sentence in the file will be the only one tokenized so that we keep the amount of output to a reasonable amount:

  1. The first step is to simply use the built-in Python string .split() method. This results in the following:
print(first_sentence.split())
['We', 'are...

Performing stemming

Stemming is the process of cutting down a token to its stem. Technically, it is the process or reducing inflected (and sometimes derived) words to their word stem - the base root form of the word. As an example, the words fishing, fished, and fisher stem from the root word fish. This helps to reduce the set of words being processed into a smaller base set that is more easily processed.

The most common algorithm for stemming was created by Martin Porter, and NLTK provides an implementation of this algorithm in the PorterStemmer. NLTK also provides an implementation of a Snowball stemmer, which was also created by Porter, and designed to handle languages other than English. There is one more implementation provided by NLTK referred to as a Lancaster stemmer. The Lancaster stemmer is considered the most aggressive stemmer of the three.

...

Performing lemmatization

Lemmatization is a more methodical process of converting words to their base. Where stemming generally just chops off the ends of words, lemmatization takes into account the morphological analysis of words, evaluating the context and part of speech to determine the inflected form, and makes a decision between different rules to determine the root.

How to do it

Lemmatization can be utilized in NTLK using the WordNetLemmatizer. This class uses the WordNet service, an online semantic database to make its decisions. The code in the 07/04_lemmatization.py file extends the previous stemming example to also calculate the lemmatization of each word. The code of importance is the following:

from nltk.stem...

Determining and removing stop words

Stop words are common words that, in a natural language processing situation, do not provide much contextual meaning. These words are often the most common words in a language. These tend to, at least in English, be articles and pronouns, such as I, me, the, is, which, who, at, among others. Processing of meaning in documents can often be facilitated by removal of these words before processing, and hence many tools support this ability. NLTK is one of these, and comes with support for stop word removal for roughly 22 languages.

How to do it

Proceed with the recipe as follows (code is available in 07/06_freq_dist.py):

  1. The following demonstrates stop word removal using NLTK. First,...

Calculating the frequency distributions of words

A frequency distribution counts the number of occurrences of distinct data values. These are of value as we can use them to determine which words or phrases within a document are most common, and from that infer those that have greater or lesser value.

Frequency distributions can be calculated using several different techniques. We will examine them using the facilities built into NLTK.

How to do it

NLTK provides a class, ntlk.probabilities.FreqDist, that allow us to very easily calculate the frequency distribution of values in a list. Let's examine using this class (code is in 07/freq_dist.py):

  1. To create a frequency distribution using NLTK, start by importing the...

Identifying and removing rare words

We can remove words with low occurences by leveraging the ability to find words with low frequency counts, that fall outside of a certain deviation of the norm, or just from a list of words considered to be rare within the given domain. But the technique we will use works the same for either.

How to do it

Rare words can be removed by building a list of those rare words and then removing them from the set of tokens being processed. The list of rare words can be determined by using the frequency distribution provided by NTLK. Then you decide what threshold should be used as a rare word threshold:

  1. The script in the 07/07_rare_words.py file extends that of the frequency distribution recipe...

Identifying and removing rare words

Removal of short words can also be useful in removing noise words from the content. The following examines removing words of a certain length or shorter. It also demonstrates the opposite by selecting the words not considered short (having a length of more than the specified short word length).

How to do it

We can leverage the frequency distribution from NLTK to efficiently calculate the short words. We could just scan all of the words in the source, but it is simply more efficient to scan the lengths of all of the keys in the resulting distribution as it will be a significantly smaller set of data:

  1. The script in the 07/08_short_words.py file exemplifies this process. It starts by loading...

Removing punctuation marks

Depending upon the tokenizer used, and the input to those tokenizers, it may be desired to remove punctuation from the resulting list of tokens. The regexp_tokenize function with '\w+' as the expression removes punctuation well, but word_tokenize does not do it very well and will return many punctuation marks as their own tokens.

How to do it

Removing punctuation marks from our tokens is done similarly to the removal of other words within our tokens by using a list comprehension and only selecting those items that are not punctuation marks. The script 07/09_remove_punctuation.py file demonstrates this. Let's walk through the process:

  1. We'll start with the following, which will...

Piecing together n-grams

Much has been written about NLTK being used to identify n-grams within text. An n-gram is a set of words, n words in length, that are common within a document/corpus (occurring 2 or more times). A 2-gram is any two words commonly repeated, a 3-gram is a three word phrase, and so on. We will not look into determining the n-grams in a document. We will focus on reconstructing known n-grams from our token streams, as we will consider those n-grams to be more important to a search result than the 2 or 3 independent words found in any order.

In the domain of parsing job listings, important 2-grams can be things such as Computer Science, SQL Server, Data Science, and Big Data. Additionally, we could consider C# a 2-gram of 'C' and '#', and hence why we might not want to use the regex parser or '#' as punctuation when processing...

Scraping a job listing from StackOverflow

Now let's pull a bit of this together to scrape information from a StackOverflow job listing. We are going to look at just one listing at this time so that we can learn the structure of these pages and pull information from them. In later chapters, we will look at aggregating results from multiple listings. Let's now just learn how to do this.

Getting ready

Reading and cleaning the description in the job listing

The description of the job listing is still in HTML. We will want to extract the valuable content out of this data, so we will need to parse this HTML and perform tokenization, stop word removal, common word removal, do some tech 2-gram processing, and in general all of those different processes. Let's look at doing these.

Getting ready

I have collapsed the code for determining tech-based 2-grams into the 07/tech2grams.py file. We will use the tech_2grams function within the file.

How to do it...

The code...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Web Scraping Cookbook
Published in: Feb 2018Publisher: PacktISBN-13: 9781787285217
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt