You're reading from The Natural Language Processing Workshop

Product type Book

Published in Aug 2020

Publisher Packt

ISBN-13 9781800208421

Pages 452 pages

Edition 1st Edition

Languages

Python

Concepts

Mobile Application Development

Authors (6):

Rohan Chopra

Aniruddha M. Godbole

Nipun Sadvilkar

Muzaffar Bashir Shah

Sohom Ghosh

Dwight Gunning

View More author details

Table of Contents (10) Chapters

Preface

1. Introduction to Natural Language Processing

2. Feature Extraction Methods

3. Developing a Text Classifier

4. Collecting Text Data with Web Scraping and APIs

5. Topic Modeling

6. Vector Representation

7. Text Generation and Summarization

8. Sentiment Analysis

Appendix

7. Text Generation and Summarization

Overview

This chapter begins with the concept of text generation using Markov chains, before moving on to two types of text summarization—namely, abstractive and extractive summarization. You will then explore the TextRank algorithm and use it with different datasets. By the end of this chapter, you will understand the applications and challenges of text generation and summarization using Natural Language Processing (NLP) approaches.

Introduction

The ability to express thoughts in words (sentence generation), the ability to replace a piece of text with different but equivalent text (paraphrasing), and the ability to find the most important parts of a piece of text (summarization) are all key elements of using language. Although sentence generation, paraphrasing, and summarization are challenging tasks in NLP, there have been great strides recently that have made them considerably more accessible. In this chapter, we explore them in detail and see how we can implement them in Python.

Generating Text with Markov Chains

An idea is expressed using the words of a language. As ideas are not tangible, it is useful to look at text generation in order to gauge whether a machine can think on its own. The utility of text generation is currently limited to an auto-complete functionality, besides a few negative use cases that we will discuss later in this section. Text can be generated in many different ways, which we will explore using Markov chains. Whether this generated text can correspond to a coherent line of thought is something that we will address later in this section.

Markov Chains

A state space defines all possible states that can exist. A Markov chain consists of a state space and a specific type of successor function. For example, in the case of the simplified state space to describe the weather, the states could be Sunny, Cloudy, or Rainy. The successor function describes how a system in its current state can move to a different state or even continue...

Text Summarization

Automated text summarization is the process of using NLP tools to produce concise versions of text that preserve the key information present in the original content. Good summaries can communicate the content with less text by retaining the key information while filtering out other information and noise (or useless text, if any). A shorter text may often take less time to read, and thus summarization facilitates more efficient use of time.

The type of summarization that we are typically taught in school is abstractive summarization. One way to think of this is to consider abstractive summarization as a combination of understanding the meaning and expressing it in fewer sentences. It is usually considered as a supervised learning problem as the original text and the summary are both required. However, a piece of text can be summarized in more than one way. This makes it hard to teach the machine in a general way. While abstractive summarization is an active area...

Key Input Parameters for TextRank

We'll be using the gensim library to implement TextRank. The following are the parameters required for this:

text: This is the input text.
ratio: This is the required ratio of the number of sentences in the summary to the number of sentences in the input text.

The gensim implementation of the TextRank algorithm uses BM25—a probabilistic variation of TF-IDF—for similarity computation in place of the similarity measure described in step 3 of the algorithm. This will be clearer in the following exercise, in which you will summarize text using TextRank.

Exercise 7.02: Performing Summarization Using TextRank

In this exercise, we will use the classic short story, After Twenty Years by O. Henry, which is available on Project Gutenberg, and the first section of the Wikipedia article on Oscar Wilde. We will summarize each text separately so that we have 20% of the sentences in the original text and then have 25% of...

Recent Developments in Text Generation and Summarization

Alan Turing (for whom the equivalent of the Nobel Prize in Computer Science is named) proposed a test for artificial intelligence in 1950. This test, known as the Turing Test, says that if humans ask questions and cannot distinguish between text responses generated by a machine and a human, then that machine can be deemed to be intelligent.

Text generation using very large models, such as the GPT-2 (with around 1.5 billion parameters) and BERT (Bidirectional Encoder Representation from Transformers) (with around 340 million parameters), can aid in auto-completion tasks. Auto-completion presents unique ethical challenges. While it can offer convenience, it can also reinforce biases in the data. This is accentuated by the fact that most user experience layouts can show only a limited number of options. Furthermore, auto-completion can controversially suggest responses that are different from what the sender originally wants...

Practical Challenges in Extractive Summarization

Given the rapid pace of development in NLP, it is even more important to use compatible versions of the libraries that we use. Evaluation of a document's suitability for extractive summarization can be undertaken manually. Often, we would like to summarize multiple pieces of text, all of which could be short in length. The TextRank algorithm will not work well in such cases.

All unverified claims reported in this field ought to be taken with a grain of salt until the claim has been verified. Such claims ought to be subjected by practitioners to naïve tests such as the Little Red Riding test. We can only use a model if it works and if the limitations related to scope and any biases are considered.

Summary

In this chapter, we learned about text generation using Markov chains and extractive summarization using the TextRank algorithm. We also explored both the power and limitations of various advanced approaches. In the next chapter, we will learn about sentiment analysis.

The rest of the chapter is locked

You're reading from The Natural Language Processing Workshop

Table of Contents (10) Chapters

7. Text Generation and Summarization

Introduction

Generating Text with Markov Chains

Markov Chains

Text Summarization

Key Input Parameters for TextRank

Exercise 7.02: Performing Summarization Using TextRank

Recent Developments in Text Generation and Summarization

Practical Challenges in Extractive Summarization

Summary

Authors (6)

Other recommended products

Personalised recommendations for you

You're reading from The Natural Language Processing Workshop

Table of Contents (10) Chapters

7. Text Generation and Summarization

Introduction

Generating Text with Markov Chains

Markov Chains

Text Summarization

Key Input Parameters for TextRank

Exercise 7.02: Performing Summarization Using TextRank

Recent Developments in Text Generation and Summarization

Practical Challenges in Extractive Summarization

Summary

Unlock this book and the full library FREE for 7 days

Authors (6)

Other recommended products

Personalised recommendations for you