Reader small image

You're reading from  Mastering Data Mining with Python - Find patterns hidden in your data

Product typeBook
Published inAug 2016
Reading LevelIntermediate
Publisher
ISBN-139781785889950
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Megan Squire
Megan Squire
author image
Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire

Right arrow

Chapter 7. Automatic Text Summarization

In an era of information overload, the objective of text summarization is to write a program that can reduce the size of a text, while preserving the main points of its meaning. The task is somewhat similar to the way an architect might create a scale model of a building. The scale model gives the viewer a sense of the important parts about the structure, but does so with a smaller size footprint, fewer details, and without the same expense in time or materials.

Consider Reddit, a news-oriented social media site, with its thousands of news articles posted daily by users. Is it possible to generate a short summary of a news article that preserves the key facts and general meaning of the original story? A few Reddit users created summary bots to do exactly this. These so-called TLDR bots (too long; didn't read) post summaries of user-submitted news stories, usually including a link to the original story and statistics to show by what percentage they reduced...

What is automatic text summarization?


In the academic literature, text summarization is often proposed as a solution to information overload, and we in the 21st century like to think that we are uniquely positioned in history in having to deal with this problem. However, even in the 1950s when automatic text summarization techniques were in their infancy, the stated goal was similar. H.P. Luhn's 1958 paper The automatic creation of literature abstracts, available in a number of places online, including at http://altaplana.com/ibm-luhn58-LiteratureAbstracts.pdf, describes a text summarization method that will save a prospective reader time and effort in finding useful information in a given article or report and that the problem of finding information is being aggravated by the ever-increasing output of technical literature.

Luhn proposed a text summarization method where the computer would read each sentence in a paper, extract the frequently occurring words, which he calls significant words...

Tools for text summarization


Since our focus in this book is data mining with Python, we will focus on understanding some of the tools, libraries, and applications designed for text summarization in a Python environment. However, if you ever find yourself in a non-Python environment, or if you have a special case where you want to use an off-the-shelf or non-Python solution, you will be glad to know that there are dozens of other text summarization tools for other programming environments, many of which require no programming at all. In fact, the autotldr bot we discussed at the beginning of this chapter uses a package called SUMMRY, which has an API that is accessible via REST and returns JSON. You can read more about SUMMRY at http://smmry.com/api.

Here we will discuss three Python solutions: a simple NLTK-based method, a Gensim-based method, and a Python summarization package called Sumy.

Naive text summarization using NLTK

So far in this book, we have used NLTK for a variety of tasks including...

Summary


Automatic text summarization is a field that is growing in importance as the volume of data in the world increases. There are numerous approaches to text summarization, but all of them rely on the construction of mathematical representations of the words and sentences in a document, then, through extractive or abstractive methods, building a program that can reduce a document to its most important parts. We reviewed three of the common extractive summarization libraries that can be integrated into our Python code: an NLTK-based summarizer, a Gensim-based approach, and a new package called Sumy with its numerous embedded summarizers. We then compared the different approaches to text summarization by using the same text sample and passing it through different summarization algorithms to see how they differed.

It is good that in this chapter, we have begun thinking about what makes an important sentence or a key word. In the next chapter, we will be learning about topic modeling, which...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Data Mining with Python - Find patterns hidden in your data
Published in: Aug 2016Publisher: ISBN-13: 9781785889950
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire