You're reading from Natural Language Processing with Python Quick Start Guide

Product typeBook

Published inNov 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781789130386

Edition1st Edition

Languages

Python

Tools

Scikit-learn SciPy

Concepts

Mobile Application Development

Author (1)

Nirant Kasliwal

Text Representations - Words to Numbers

Computers today cannot act on words or text directly. They need to be represented by meaningful number sequences. These long sequences of decimal numbers are called vectors, and this step is often referred to as the vectorization of text.

So, where are these word vectors used:

In text classification and summarization tasks
During similar word searches, such as synonyms
In machine translation (for example, when translating text from English to German)
When understanding similar texts (for example, Facebook articles)
During question and answer sessions, and general tasks (for example, chatbots used in appointment scheduling)

Very frequently, we see word vectors used in some form of categorization task. For instance, using a machine learning or deep learning model for sentiment analysis, with the following text vectorization methods:

TF...

Vectorizing a specific dataset

This section focuses almost exclusively on word vectors and how we can leverage the Gensim library to perform them.

Some of the questions we want to answer in this section include these:

How do we use original embedding, such as GLoVe?
How do we handle Out of Vocabulary words? (Hint: fastText)
How do we train our own word2vec vectors on our own corpus?
How do we train our own word2vec vectors?
How do we train our own fastText vectors?
How do we use similar words to compare both of the above?

First, let's get started with some simple imports, as follows:

import gensim
print(f'gensim: {gensim.__version__}')
> gensim: 3.4.0

Please ensure that your Gensim version is at least 3.4.0. This is a very popular package which is maintained and developed mostly by text processing experts over at RaRe Technologies. They use the same library in...

Word representations

The most popular names in word embedding are word2vec by Google (Mikolov) and GloVe by Stanford (Pennington, Socher, and Manning). fastText seems to be fairly popular for multilingual sub-word embeddings.

We advise that you don't use word2vec or GloVe. Instead, use fastText vectors, which are much better and from the same authors. word2vec was introduced by T. Mikolov et. al. (https://scholar.google.com/citations?user=oBu8kMMAAAAJ&hl=en) when he was with Google, and it performs well on word similarity and analogy tasks.

GloVe was introduced by Pennington, Socher, and Manning from Stanford in 2014 as a statistical approximation for word embedding. The word vectors are created by the matrix factorization of word-word co-occurrence matrices.

If picking between the lesser of two evils, we recommend using GloVe over word2vec. This is because GloVe outperforms...

Document embedding

Document embedding is often considered an underrated way of doing things. The key idea in document embedding is to compress an entire document, for example a patent or customer review, into one single vector. This vector in turn can be used for a lot of downstream tasks.

Empirical results show that document vectors outperform bag-of-words models as well as other techniques for text representation.

Among the most useful downstream tasks is the ability to cluster text. Text clustering has several uses, ranging from data exploration to online classification of incoming text in a pipeline.

In particular, we are interested in document modeling using doc2vec on a small dataset. Unlike sequence models such as RNN, where a word sequence is captured in generated sentence vectors, doc2vec sentence vectors are word order independent. This word order independence means...

Summary

This chapter was more than an introduction to the Gensim API. We now know how to load pre-trained GloVe vectors, and you can use these vector representations instead of TD-IDF in any machine learning model.

We looked at why fastText vectors are often better than word2vec vectors on a small training corpus, and learned that you can use them with any ML models.

We learned how to build doc2vec models. You can now extend this doc2vec approach to build sent2vec or paragraph2vec style models as well. Ideally, paragraph2vec will change, simply because each document will be a paragraph instead.

In addition, we now know how we can quickly perform sanity checks on our doc2vec vectors without using an annotated test corpora. We did this by checking the rank dispersal metric.

The rest of the chapter is locked

You have been reading a chapter from

Natural Language Processing with Python Quick Start Guide

Published in: Nov 2018Publisher: PacktISBN-13: 9781789130386

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Nirant Kasliwal

Nirant Kasliwal maintains an awesome list of NLP natural language processing resources. GitHub's machine learning collection features this as the go-to guide. Nobel Laureate Dr. Paul Romer found his programming notes on Jupyter Notebooks helpful. Nirant won the first ever NLP Google Kaggle Kernel Award. At Soroco, image segmentation and intent categorization are the challenges he works with. His state-of-the-art language modeling results are available as Hindi2vec.
Read more about Nirant Kasliwal

Other recommended products

Related to this chapter

Mastering spaCy

Using machine learning-based NLP models, you can speed up business processes, make more accurate predictions, and uncover new insights from your existing data, where spaCy, an advanced industrial-grade natural language processing library, can help. With this book, you'll learn how to use it and create high-impact ML solutions for NLP.

BookJul 2021356 pages

Python Natural Language Processing Cookbook

Leverage your natural language processing skills to make sense of text. With this book, you'll learn fundamental and advanced NLP techniques in Python that will help you to make your data fit for application in a wide variety of industries. You’ll also find recipes for overcoming common challenges in implementing NLP pipelines.

BookMar 2021284 pages

Natural Language Processing and Computational Linguistics

Discover how you can perform your own modern text analysis, to make predictions, create inferences, and gain insights about the data around you today. Learn how to harness the powerful Python ecosystem and tools such as spaCy and Gensim to perform natural language processing, and computational linguistics algorithms.

BookJun 2018306 pages

fastText Quick Start Guide

Facebook's fastText library handles text representation and classification, used for Natural Language Processing (NLP). Most organizations have to deal with enormous amounts of text data on a daily basis, and efficient data insights requires powerful NLP tools like fastText. This book is your ideal introduction to fastText.

BookJul 2018194 pages

Recurrent Neural Networks with Python Quick Start Guide

Developers struggle to find an easy to follow learning resource for implementing Recurrent Neural Network(RNN) models. RNNs are the state-of-the-art model in deep learning for dealing with sequential data. From language translation to generating captions for an image, RNNs are used to continuously improve the results. This book will teach you the fundamentals of RNNs with example applications in Python and the TensorFlow library. The examples are accompanied by the right combination of theoretical knowledge and real-world implementations of concepts to build a solid foundation of neural network modeling.

BookNov 2018122 pages

Hands-On Machine Learning for Cybersecurity

The book will allow readers to implement smart solutions to their existing cybersecurity products and effectively build intelligent solutions which cater to the needs of the future. By the end of this book, you will be able to build, apply, and evaluate machine learning algorithms to identify various cybersecurity potential threats.

BookDec 2018318 pages

Hands-On Python Natural Language Processing

This book provides a blend of both the theoretical and practical aspects of Natural Language Processing (NLP). It covers the concepts essential to develop a thorough understanding of NLP and also delves into a detailed discussion on NLP based use-cases such as language translation, sentiment analysis, etc. Every module covers real-world examples

BookJun 2020316 pages4

Hands-On Natural Language Processing with Python

This book teaches you to leverage deep learning models in performing various NLP tasks along with showcasing the best practices in dealing with the NLP challenges. The book equips you with practical knowledge to implement deep learning in your linguistic applications using NLTk and Python's popular deep learning library, TensorFlow.

BookJul 2018312 pages

Hands-On Natural Language Processing with PyTorch 1.x

Developers working with NLP will be able to put their knowledge to work with this practical guide to PyTorch. You will learn to use PyTorch offerings and how to understand and analyze text using Python. You will learn to extract the underlying meaning in the text using deep neural networks and modern deep learning algorithms.

BookJul 2020276 pages

The Natural Language Processing Workshop

The Natural Language Processing Workshop takes you through fundamental NLP techniques, such as preparing datasets, collecting text, extracting text, and sentiment analysis. As you progress, you’ll get to grips with creating your own chatbots and dynamic models.

BookAug 2020452 pages

Artificial Vision and Language Processing for Robotics

Artificial Vision and Language Processing for Robotics will help you train and deploy models that are built with advanced deep learning techniques and integrate them in a unified end-to-end application. With this book, you will learn all about the three hottest topics of artificial intelligence: convolutional neural networks, recurrent neural networks, and robotics.

BookApr 2019356 pages

Natural Language Processing Fundamentals

Natural Language Processing Fundamentals starts with basics and goes on to explain various NLP tools and techniques that equip you with all that you need to solve common business problems for processing text.

BookMar 2019374 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages