You're reading from Python Web Scraping Cookbook

Product typeBook

Published inFeb 2018

Reading LevelBeginner

PublisherPackt

ISBN-139781787285217

Edition1st Edition

Languages

Python

Tools

Scrapy

Concepts

Data Mining

Author (1)

Michael Heydt

Text Wrangling and Analysis

In this chapter, we will cover:

Installing NLTK
Performing sentence splitting
Performing tokenization
Performing stemming
Performing lemmatization
Identifying and removing stop words
Calculating the frequency distribution of words
Identifying and removing rare words
Identifying and removing short words
Removing punctuation marks
Piecing together n-grams
Scraping a job listing from StackOverflow
Reading and cleaning the description in the job listCreating a word cloud from a StackOverflow job listing

Introduction

Mining the data is often the most interesting part of the job, and text is one of the most common data sources. We will be using the NLTK toolkit to introduce common natural language processing concepts and statistical models. Not only do we want to find quantitative data, such as numbers within data that we have scraped, we also want to be able to analyze various characteristics of textual information. This analysis of textual information is often lumped into a category known as natural language processing (NLP). There exists a library for Python, NLTK, that provides rich capabilities. We will investigate several of it's capabilities.

Installing NLTK

In this recipe we learn to install NTLK, the natural language toolkit for Python.

How to do it

We proceed with the recipe as follows:

The core of NLTK can be installed using pip:

pip install nltk

Some processes, such as those we will use, require an additional download of various data sets that they use to perform various analyses. They can be downloaded by executing the following:

import nltk
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

On a Mac, this actually pops up the following window:

The NTLK GUI

Select install all and press the Download button. The tools will begin to download a number of data sets. This can take a while, so grab a coffee...

Performing sentence splitting

Many NLP processes require splitting a large amount of text into sentences. This may seem to be a simple task, but for computers it can be problematic. A simple sentence splitter can look just for periods (.), or use other algorithms such as predictive classifiers. We will examine two means of sentence splitting with NLTK.

How to do it

We will use a sentence stored in thee 07/sentence1.txt file. It has the following content, which was pulled from a random job listing on StackOverflow:

We are seeking developers with demonstrable experience in: ASP.NET, C#, SQL Server, and AngularJS. We are a fast-paced, highly iterative team that has to adapt quickly as our factory grows. We need people who...

Performing tokenization

Tokenization is the process of converting text into tokens. These tokens can be paragraphs, sentences, and common individual words, and are commonly based at the word level. NLTK comes with a number of tokenizers that will be demonstrated in this recipe.

How to do it

The code for this example is in the 07/02_tokenize.py file. This extends the sentence splitter to demonstrate five different tokenization techniques. The first sentence in the file will be the only one tokenized so that we keep the amount of output to a reasonable amount:

The first step is to simply use the built-in Python string .split() method. This results in the following:

print(first_sentence.split())
['We', 'are...

Performing stemming

Stemming is the process of cutting down a token to its stem. Technically, it is the process or reducing inflected (and sometimes derived) words to their word stem - the base root form of the word. As an example, the words fishing, fished, and fisher stem from the root word fish. This helps to reduce the set of words being processed into a smaller base set that is more easily processed.

The most common algorithm for stemming was created by Martin Porter, and NLTK provides an implementation of this algorithm in the PorterStemmer. NLTK also provides an implementation of a Snowball stemmer, which was also created by Porter, and designed to handle languages other than English. There is one more implementation provided by NLTK referred to as a Lancaster stemmer. The Lancaster stemmer is considered the most aggressive stemmer of the three.

...

Performing lemmatization

Lemmatization is a more methodical process of converting words to their base. Where stemming generally just chops off the ends of words, lemmatization takes into account the morphological analysis of words, evaluating the context and part of speech to determine the inflected form, and makes a decision between different rules to determine the root.

How to do it

Lemmatization can be utilized in NTLK using the WordNetLemmatizer. This class uses the WordNet service, an online semantic database to make its decisions. The code in the 07/04_lemmatization.py file extends the previous stemming example to also calculate the lemmatization of each word. The code of importance is the following:

from nltk.stem...

Determining and removing stop words

Stop words are common words that, in a natural language processing situation, do not provide much contextual meaning. These words are often the most common words in a language. These tend to, at least in English, be articles and pronouns, such as I, me, the, is, which, who, at, among others. Processing of meaning in documents can often be facilitated by removal of these words before processing, and hence many tools support this ability. NLTK is one of these, and comes with support for stop word removal for roughly 22 languages.

How to do it

Proceed with the recipe as follows (code is available in 07/06_freq_dist.py):

The following demonstrates stop word removal using NLTK. First,...

Calculating the frequency distributions of words

A frequency distribution counts the number of occurrences of distinct data values. These are of value as we can use them to determine which words or phrases within a document are most common, and from that infer those that have greater or lesser value.

Frequency distributions can be calculated using several different techniques. We will examine them using the facilities built into NLTK.

How to do it

NLTK provides a class, ntlk.probabilities.FreqDist, that allow us to very easily calculate the frequency distribution of values in a list. Let's examine using this class (code is in 07/freq_dist.py):

To create a frequency distribution using NLTK, start by importing the...

Identifying and removing rare words

We can remove words with low occurences by leveraging the ability to find words with low frequency counts, that fall outside of a certain deviation of the norm, or just from a list of words considered to be rare within the given domain. But the technique we will use works the same for either.

How to do it

Rare words can be removed by building a list of those rare words and then removing them from the set of tokens being processed. The list of rare words can be determined by using the frequency distribution provided by NTLK. Then you decide what threshold should be used as a rare word threshold:

The script in the 07/07_rare_words.py file extends that of the frequency distribution recipe...

Identifying and removing rare words

Removal of short words can also be useful in removing noise words from the content. The following examines removing words of a certain length or shorter. It also demonstrates the opposite by selecting the words not considered short (having a length of more than the specified short word length).

How to do it

We can leverage the frequency distribution from NLTK to efficiently calculate the short words. We could just scan all of the words in the source, but it is simply more efficient to scan the lengths of all of the keys in the resulting distribution as it will be a significantly smaller set of data:

The script in the 07/08_short_words.py file exemplifies this process. It starts by loading...

Removing punctuation marks

Depending upon the tokenizer used, and the input to those tokenizers, it may be desired to remove punctuation from the resulting list of tokens. The regexp_tokenize function with '\w+' as the expression removes punctuation well, but word_tokenize does not do it very well and will return many punctuation marks as their own tokens.

How to do it

Removing punctuation marks from our tokens is done similarly to the removal of other words within our tokens by using a list comprehension and only selecting those items that are not punctuation marks. The script 07/09_remove_punctuation.py file demonstrates this. Let's walk through the process:

We'll start with the following, which will...

Piecing together n-grams

Much has been written about NLTK being used to identify n-grams within text. An n-gram is a set of words, n words in length, that are common within a document/corpus (occurring 2 or more times). A 2-gram is any two words commonly repeated, a 3-gram is a three word phrase, and so on. We will not look into determining the n-grams in a document. We will focus on reconstructing known n-grams from our token streams, as we will consider those n-grams to be more important to a search result than the 2 or 3 independent words found in any order.

In the domain of parsing job listings, important 2-grams can be things such as Computer Science, SQL Server, Data Science, and Big Data. Additionally, we could consider C# a 2-gram of 'C' and '#', and hence why we might not want to use the regex parser or '#' as punctuation when processing...

Scraping a job listing from StackOverflow

Now let's pull a bit of this together to scrape information from a StackOverflow job listing. We are going to look at just one listing at this time so that we can learn the structure of these pages and pull information from them. In later chapters, we will look at aggregating results from multiple listings. Let's now just learn how to do this.

Getting ready

StackOverflow actually makes it quite easy to scrape data from their pages. We are going to use content from a posting at https://stackoverflow.com/jobs/122517/spacex-enterprise-software-engineer-full-stack-spacex?so=p&sec=True&pg=1&offset=22&cl=Amazon%3b+. This likely will not be available at the...

Reading and cleaning the description in the job listing

The description of the job listing is still in HTML. We will want to extract the valuable content out of this data, so we will need to parse this HTML and perform tokenization, stop word removal, common word removal, do some tech 2-gram processing, and in general all of those different processes. Let's look at doing these.

Getting ready

I have collapsed the code for determining tech-based 2-grams into the 07/tech2grams.py file. We will use the tech_2grams function within the file.

How to do it...

The code...

The rest of the chapter is locked

You have been reading a chapter from

Python Web Scraping Cookbook

Published in: Feb 2018Publisher: PacktISBN-13: 9781787285217

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.
Read more about Michael Heydt

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages