Packt+ | Advance your knowledge in tech

You're reading from Mastering Data Mining with Python - Find patterns hidden in your data

Product typeBook

Published inAug 2016

Reading LevelIntermediate

Publisher

ISBN-139781785889950

Edition1st Edition

Languages

Python

Tools

NLTK Scikit-learn

Concepts

Data Mining

Author (1)

Megan Squire

Chapter 7. Automatic Text Summarization

In an era of information overload, the objective of text summarization is to write a program that can reduce the size of a text, while preserving the main points of its meaning. The task is somewhat similar to the way an architect might create a scale model of a building. The scale model gives the viewer a sense of the important parts about the structure, but does so with a smaller size footprint, fewer details, and without the same expense in time or materials.

Consider Reddit, a news-oriented social media site, with its thousands of news articles posted daily by users. Is it possible to generate a short summary of a news article that preserves the key facts and general meaning of the original story? A few Reddit users created summary bots to do exactly this. These so-called TLDR bots (too long; didn't read) post summaries of user-submitted news stories, usually including a link to the original story and statistics to show by what percentage they reduced...

What is automatic text summarization?

In the academic literature, text summarization is often proposed as a solution to information overload, and we in the 21st century like to think that we are uniquely positioned in history in having to deal with this problem. However, even in the 1950s when automatic text summarization techniques were in their infancy, the stated goal was similar. H.P. Luhn's 1958 paper The automatic creation of literature abstracts, available in a number of places online, including at http://altaplana.com/ibm-luhn58-LiteratureAbstracts.pdf, describes a text summarization method that will save a prospective reader time and effort in finding useful information in a given article or report and that the problem of finding information is being aggravated by the ever-increasing output of technical literature.

Luhn proposed a text summarization method where the computer would read each sentence in a paper, extract the frequently occurring words, which he calls significant words...

Tools for text summarization

Since our focus in this book is data mining with Python, we will focus on understanding some of the tools, libraries, and applications designed for text summarization in a Python environment. However, if you ever find yourself in a non-Python environment, or if you have a special case where you want to use an off-the-shelf or non-Python solution, you will be glad to know that there are dozens of other text summarization tools for other programming environments, many of which require no programming at all. In fact, the autotldr bot we discussed at the beginning of this chapter uses a package called SUMMRY, which has an API that is accessible via REST and returns JSON. You can read more about SUMMRY at http://smmry.com/api.

Here we will discuss three Python solutions: a simple NLTK-based method, a Gensim-based method, and a Python summarization package called Sumy.

Naive text summarization using NLTK

So far in this book, we have used NLTK for a variety of tasks including...

Summary

Automatic text summarization is a field that is growing in importance as the volume of data in the world increases. There are numerous approaches to text summarization, but all of them rely on the construction of mathematical representations of the words and sentences in a document, then, through extractive or abstractive methods, building a program that can reduce a document to its most important parts. We reviewed three of the common extractive summarization libraries that can be integrated into our Python code: an NLTK-based summarizer, a Gensim-based approach, and a new package called Sumy with its numerous embedded summarizers. We then compared the different approaches to text summarization by using the same text sample and passing it through different summarization algorithms to see how they differed.

It is good that in this chapter, we have begun thinking about what makes an important sentence or a key word. In the next chapter, we will be learning about topic modeling, which...

The rest of the chapter is locked

You have been reading a chapter from

Mastering Data Mining with Python - Find patterns hidden in your data

Published in: Aug 2016Publisher: ISBN-13: 9781785889950

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages