Reader small image

You're reading from  Natural Language Processing with Python Quick Start Guide

Product typeBook
Published inNov 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789130386
Edition1st Edition
Languages
Right arrow
Author (1)
Nirant Kasliwal
Nirant Kasliwal
author image
Nirant Kasliwal

Nirant Kasliwal maintains an awesome list of NLP natural language processing resources. GitHub's machine learning collection features this as the go-to guide. Nobel Laureate Dr. Paul Romer found his programming notes on Jupyter Notebooks helpful. Nirant won the first ever NLP Google Kaggle Kernel Award. At Soroco, image segmentation and intent categorization are the challenges he works with. His state-of-the-art language modeling results are available as Hindi2vec.
Read more about Nirant Kasliwal

Right arrow

Cleaning a corpus with FlashText

But what about a web-scale corpus with millions of documents and a few thousand keywords? Regex can take several days to run over such exact searches because of its linear time complexity. How can we improve this?

We can use FlashText for this very specific use case:

  • A few million documents with a few thousand keywords
  • Exact keyword matches either by replacing or searching for the presence of those keywords

Of course, there are several different possible solutions to this problem. I recommend this for its simplicity and focus on solving one problem. It does not require us to learn new syntax or set up specific tools such as ElasticSearch.

The following table gives you a comparison of using Flashtext versus compiled regex for searching:

The following tables gives you a comparison of using FlashText versus compiled regex for substitutions...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Natural Language Processing with Python Quick Start Guide
Published in: Nov 2018Publisher: PacktISBN-13: 9781789130386

Author (1)

author image
Nirant Kasliwal

Nirant Kasliwal maintains an awesome list of NLP natural language processing resources. GitHub's machine learning collection features this as the go-to guide. Nobel Laureate Dr. Paul Romer found his programming notes on Jupyter Notebooks helpful. Nirant won the first ever NLP Google Kaggle Kernel Award. At Soroco, image segmentation and intent categorization are the challenges he works with. His state-of-the-art language modeling results are available as Hindi2vec.
Read more about Nirant Kasliwal