Reader small image

You're reading from  Natural Language Processing with Python Quick Start Guide

Product typeBook
Published inNov 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789130386
Edition1st Edition
Languages
Right arrow
Author (1)
Nirant Kasliwal
Nirant Kasliwal
author image
Nirant Kasliwal

Nirant Kasliwal maintains an awesome list of NLP natural language processing resources. GitHub's machine learning collection features this as the go-to guide. Nobel Laureate Dr. Paul Romer found his programming notes on Jupyter Notebooks helpful. Nirant won the first ever NLP Google Kaggle Kernel Award. At Soroco, image segmentation and intent categorization are the challenges he works with. His state-of-the-art language modeling results are available as Hindi2vec.
Read more about Nirant Kasliwal

Right arrow

Text Representations - Words to Numbers

Computers today cannot act on words or text directly. They need to be represented by meaningful number sequences. These long sequences of decimal numbers are called vectors, and this step is often referred to as the vectorization of text.

So, where are these word vectors used:

  • In text classification and summarization tasks
  • During similar word searches, such as synonyms
  • In machine translation (for example, when translating text from English to German)
  • When understanding similar texts (for example, Facebook articles)
  • During question and answer sessions, and general tasks (for example, chatbots used in appointment scheduling)

Very frequently, we see word vectors used in some form of categorization task. For instance, using a machine learning or deep learning model for sentiment analysis, with the following text vectorization methods:

  • TF...

Vectorizing a specific dataset

This section focuses almost exclusively on word vectors and how we can leverage the Gensim library to perform them.

Some of the questions we want to answer in this section include these:

  • How do we use original embedding, such as GLoVe?
  • How do we handle Out of Vocabulary words? (Hint: fastText)
  • How do we train our own word2vec vectors on our own corpus?
  • How do we train our own word2vec vectors?
  • How do we train our own fastText vectors?
  • How do we use similar words to compare both of the above?

First, let's get started with some simple imports, as follows:

import gensim
print(f'gensim: {gensim.__version__}')
> gensim: 3.4.0

Please ensure that your Gensim version is at least 3.4.0. This is a very popular package which is maintained and developed mostly by text processing experts over at RaRe Technologies. They use the same library in...

Word representations

The most popular names in word embedding are word2vec by Google (Mikolov) and GloVe by Stanford (Pennington, Socher, and Manning). fastText seems to be fairly popular for multilingual sub-word embeddings.

We advise that you don't use word2vec or GloVe. Instead, use fastText vectors, which are much better and from the same authors. word2vec was introduced by T. Mikolov et. al. (https://scholar.google.com/citations?user=oBu8kMMAAAAJ&hl=en) when he was with Google, and it performs well on word similarity and analogy tasks.

GloVe was introduced by Pennington, Socher, and Manning from Stanford in 2014 as a statistical approximation for word embedding. The word vectors are created by the matrix factorization of word-word co-occurrence matrices.

If picking between the lesser of two evils, we recommend using GloVe over word2vec. This is because GloVe outperforms...

Document embedding

Document embedding is often considered an underrated way of doing things. The key idea in document embedding is to compress an entire document, for example a patent or customer review, into one single vector. This vector in turn can be used for a lot of downstream tasks.

Empirical results show that document vectors outperform bag-of-words models as well as other techniques for text representation.

Among the most useful downstream tasks is the ability to cluster text. Text clustering has several uses, ranging from data exploration to online classification of incoming text in a pipeline.

In particular, we are interested in document modeling using doc2vec on a small dataset. Unlike sequence models such as RNN, where a word sequence is captured in generated sentence vectors, doc2vec sentence vectors are word order independent. This word order independence means...

Summary

This chapter was more than an introduction to the Gensim API. We now know how to load pre-trained GloVe vectors, and you can use these vector representations instead of TD-IDF in any machine learning model.

We looked at why fastText vectors are often better than word2vec vectors on a small training corpus, and learned that you can use them with any ML models.

We learned how to build doc2vec models. You can now extend this doc2vec approach to build sent2vec or paragraph2vec style models as well. Ideally, paragraph2vec will change, simply because each document will be a paragraph instead.

In addition, we now know how we can quickly perform sanity checks on our doc2vec vectors without using an annotated test corpora. We did this by checking the rank dispersal metric.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Natural Language Processing with Python Quick Start Guide
Published in: Nov 2018Publisher: PacktISBN-13: 9781789130386
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Nirant Kasliwal

Nirant Kasliwal maintains an awesome list of NLP natural language processing resources. GitHub's machine learning collection features this as the go-to guide. Nobel Laureate Dr. Paul Romer found his programming notes on Jupyter Notebooks helpful. Nirant won the first ever NLP Google Kaggle Kernel Award. At Soroco, image segmentation and intent categorization are the challenges he works with. His state-of-the-art language modeling results are available as Hindi2vec.
Read more about Nirant Kasliwal