Reader small image

You're reading from  Natural Language Understanding with Python

Product typeBook
Published inJun 2023
PublisherPackt
ISBN-139781804613429
Edition1st Edition
Right arrow
Author (1)
Deborah A. Dahl
Deborah A. Dahl
author image
Deborah A. Dahl

Deborah A. Dahl is the principal at Conversational Technologies, with over 30 years of experience in natural language understanding technology. She has developed numerous natural language processing systems for research, commercial, and government applications, including a system for NASA, and speech and natural language components on Android. She has taught over 20 workshops on natural language processing, consulted on many natural language processing applications for her customers, and written over 75 technical papers. Th is is Deborah's fourth book on natural language understanding topics. Deborah has a PhD in linguistics from the University of Minnesota and postdoctoral studies in cognitive science from the University of Pennsylvania.
Read more about Deborah A. Dahl

Right arrow

Selecting Approaches and Representing Data

This chapter will cover the next steps in getting ready to implement a natural language processing (NLP) application. We start with some basic considerations about understanding how much data is needed for an application, what to do about specialized vocabulary and syntax, and take into account the need for different types of computational resources. We then discuss the first steps in NLP – text representation formats that will get our data ready for processing with NLP algorithms. These formats include symbolic and numerical approaches for representing words and documents. To some extent, data formats and algorithms can be mixed and matched in an application, so it is helpful to consider data representation independently from the consideration of algorithms.

The first section will review general considerations for selecting NLP approaches that have to do with the type of application we’re working on, and with the data that...

Selecting NLP approaches

NLP can be done with a wide variety of possible techniques. When you get started on an NLP application, you have many choices to make, which are affected by a large number of factors. One of the most important factors is the type of application itself and the information that the system needs to extract from the data to perform the intended task. The next section addresses how the application affects the choice of techniques.

Fitting the approach to the task

Recall from Chapter 1, that there are many different types of NLP applications divided into interactive and non-interactive applications. The type of application you choose will play an important role in choosing the technologies that will be applied to the task. Another way of categorizing applications is in terms of the level of detail required to extract the needed information from the document. At the coarsest level of analysis (for example, classifying documents into two different categories...

Representing language for NLP applications

For computers to work with natural language, it has to be represented in a form that they can process. These representations can be symbolic, where the words in a text are processed directly, or numeric, where the representation is in the form of numbers. We will describe both of these approaches here. Although the numeric approach is the primary approach currently used in NLP research and applications, it is worth becoming somewhat familiar with the ideas behind symbolic processing.

Symbolic representations

Traditionally, NLP has been based on processing the words in texts directly, as words. This approach was embodied in a standard approach where the text was analyzed in a series of steps that were aimed at converting an input consisting of unanalyzed words into a meaning. In a traditional NLP pipeline, shown in Figure 7.1, each step in processing, from input text to meaning, produces an output that adds more structure to its input...

Representing language numerically with vectors

A common mathematical technique for representing language in preparation for machine learning is through the use of vectors. Both documents and words can be represented with vectors. We’ll start by discussing document vectors.

Understanding vectors for document representation

We have seen that texts can be represented as sequences of symbols such as words, which is the way that we read them. However, it is usually more convenient for computational NLP purposes to represent text numerically, especially if we are dealing with large quantities of text. Another advantage of numerical representation is that we can also process text represented numerically with a much wider range of mathematical techniques.

A common way to represent both documents and words is by using vectors, which are basically one-dimensional arrays. Along with words, we can also use vectors to represent other linguistic units, such as lemmas or stemmed words...

Representing words with context-independent vectors

So far, we have looked at several ways of representing similarities among documents. However, finding out that two or more documents are similar to each other is not very specific, although it can be useful for some applications, such as intent or document classification. In this section, we will talk about representing the meanings of words with word vectors.

Word2Vec

Word2Vec is a popular library for representing words as vectors, published by Google in 2013 (Mikolov, Tomas; et al. (2013). Efficient Estimation of Word Representations in Vector Space. https://arxiv.org/abs/1301.3781). The basic idea behind Word2Vec is that every word in a corpus is represented by a single vector that is computed based on all the contexts (nearby words) in which the word occurs. The intuition behind this approach is that words with similar meanings will occur in similar contexts. This intuition is summarized in a famous quote from the linguist...

Representing words with context-dependent vectors

Word2Vec’s word vectors are context-independent in that a word always has the same vector no matter what context it occurs in. However, in fact, the meanings of words are strongly affected by nearby words. For example, the meanings of the word film in We enjoyed the film and the table was covered with a thin film of dust are quite different. To capture these contextual differences in meanings, we would like to have a way to have different vector representations of these words that reflect the differences in meanings that result from the different contexts. This research direction has been extensively explored in the last few years, starting with the BERT (Bidirectional Encoder Representations from Transformers) system (https://aclanthology.org/N19-1423/ (Devlin et al., NAACL 2019)).

This approach has resulted in great improvements in NLP technology, which we will want to discuss in depth. For that reason, we will postpone...

Summary

In this chapter, we’ve learned how to select different NLP approaches, based on the available data and other requirements. In addition, we’ve learned about representing data for NLP applications. We’ve placed particular emphasis on vector representations, including vector representations of both documents and words. For documents, we’ve covered binary bag of words, count bag of words, and TF-IDF. For representing words, we’ve reviewed the Word2Vec approach and briefly introduced context-dependent vectors, which will be covered in much more detail in Chapter 11.

In the next four chapters, we will take the representations that we’ve learned about in this chapter and show how to train models from them that can be applied to different problems such as document classification and intent recognition. We will start with rule-based techniques in Chapter 8, discuss traditional machine learning techniques in Chapter 9, talk about neural networks...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Natural Language Understanding with Python
Published in: Jun 2023Publisher: PacktISBN-13: 9781804613429
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Deborah A. Dahl

Deborah A. Dahl is the principal at Conversational Technologies, with over 30 years of experience in natural language understanding technology. She has developed numerous natural language processing systems for research, commercial, and government applications, including a system for NASA, and speech and natural language components on Android. She has taught over 20 workshops on natural language processing, consulted on many natural language processing applications for her customers, and written over 75 technical papers. Th is is Deborah's fourth book on natural language understanding topics. Deborah has a PhD in linguistics from the University of Minnesota and postdoctoral studies in cognitive science from the University of Pennsylvania.
Read more about Deborah A. Dahl