You're reading from Natural Language Understanding with Python

Product typeBook

Published inJun 2023

PublisherPackt

ISBN-139781804613429

Edition1st Edition

Concepts

Machine Learning

Author (1)

Deborah A. Dahl

Selecting Approaches and Representing Data

This chapter will cover the next steps in getting ready to implement a natural language processing (NLP) application. We start with some basic considerations about understanding how much data is needed for an application, what to do about specialized vocabulary and syntax, and take into account the need for different types of computational resources. We then discuss the first steps in NLP – text representation formats that will get our data ready for processing with NLP algorithms. These formats include symbolic and numerical approaches for representing words and documents. To some extent, data formats and algorithms can be mixed and matched in an application, so it is helpful to consider data representation independently from the consideration of algorithms.

The first section will review general considerations for selecting NLP approaches that have to do with the type of application we’re working on, and with the data that...

Selecting NLP approaches

NLP can be done with a wide variety of possible techniques. When you get started on an NLP application, you have many choices to make, which are affected by a large number of factors. One of the most important factors is the type of application itself and the information that the system needs to extract from the data to perform the intended task. The next section addresses how the application affects the choice of techniques.

Fitting the approach to the task

Recall from Chapter 1, that there are many different types of NLP applications divided into interactive and non-interactive applications. The type of application you choose will play an important role in choosing the technologies that will be applied to the task. Another way of categorizing applications is in terms of the level of detail required to extract the needed information from the document. At the coarsest level of analysis (for example, classifying documents into two different categories...

Representing language for NLP applications

For computers to work with natural language, it has to be represented in a form that they can process. These representations can be symbolic, where the words in a text are processed directly, or numeric, where the representation is in the form of numbers. We will describe both of these approaches here. Although the numeric approach is the primary approach currently used in NLP research and applications, it is worth becoming somewhat familiar with the ideas behind symbolic processing.

Symbolic representations

Traditionally, NLP has been based on processing the words in texts directly, as words. This approach was embodied in a standard approach where the text was analyzed in a series of steps that were aimed at converting an input consisting of unanalyzed words into a meaning. In a traditional NLP pipeline, shown in Figure 7.1, each step in processing, from input text to meaning, produces an output that adds more structure to its input...

Representing language numerically with vectors

A common mathematical technique for representing language in preparation for machine learning is through the use of vectors. Both documents and words can be represented with vectors. We’ll start by discussing document vectors.

Understanding vectors for document representation

We have seen that texts can be represented as sequences of symbols such as words, which is the way that we read them. However, it is usually more convenient for computational NLP purposes to represent text numerically, especially if we are dealing with large quantities of text. Another advantage of numerical representation is that we can also process text represented numerically with a much wider range of mathematical techniques.

A common way to represent both documents and words is by using vectors, which are basically one-dimensional arrays. Along with words, we can also use vectors to represent other linguistic units, such as lemmas or stemmed words...

Representing words with context-independent vectors

So far, we have looked at several ways of representing similarities among documents. However, finding out that two or more documents are similar to each other is not very specific, although it can be useful for some applications, such as intent or document classification. In this section, we will talk about representing the meanings of words with word vectors.

Word2Vec

Word2Vec is a popular library for representing words as vectors, published by Google in 2013 (Mikolov, Tomas; et al. (2013). Efficient Estimation of Word Representations in Vector Space. https://arxiv.org/abs/1301.3781). The basic idea behind Word2Vec is that every word in a corpus is represented by a single vector that is computed based on all the contexts (nearby words) in which the word occurs. The intuition behind this approach is that words with similar meanings will occur in similar contexts. This intuition is summarized in a famous quote from the linguist...

Representing words with context-dependent vectors

Word2Vec’s word vectors are context-independent in that a word always has the same vector no matter what context it occurs in. However, in fact, the meanings of words are strongly affected by nearby words. For example, the meanings of the word film in We enjoyed the film and the table was covered with a thin film of dust are quite different. To capture these contextual differences in meanings, we would like to have a way to have different vector representations of these words that reflect the differences in meanings that result from the different contexts. This research direction has been extensively explored in the last few years, starting with the BERT (Bidirectional Encoder Representations from Transformers) system (https://aclanthology.org/N19-1423/ (Devlin et al., NAACL 2019)).

This approach has resulted in great improvements in NLP technology, which we will want to discuss in depth. For that reason, we will postpone...

Summary

In this chapter, we’ve learned how to select different NLP approaches, based on the available data and other requirements. In addition, we’ve learned about representing data for NLP applications. We’ve placed particular emphasis on vector representations, including vector representations of both documents and words. For documents, we’ve covered binary bag of words, count bag of words, and TF-IDF. For representing words, we’ve reviewed the Word2Vec approach and briefly introduced context-dependent vectors, which will be covered in much more detail in Chapter 11.

In the next four chapters, we will take the representations that we’ve learned about in this chapter and show how to train models from them that can be applied to different problems such as document classification and intent recognition. We will start with rule-based techniques in Chapter 8, discuss traditional machine learning techniques in Chapter 9, talk about neural networks...

The rest of the chapter is locked

You have been reading a chapter from

Natural Language Understanding with Python

Published in: Jun 2023Publisher: PacktISBN-13: 9781804613429

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Deborah A. Dahl

Deborah A. Dahl is the principal at Conversational Technologies, with over 30 years of experience in natural language understanding technology. She has developed numerous natural language processing systems for research, commercial, and government applications, including a system for NASA, and speech and natural language components on Android. She has taught over 20 workshops on natural language processing, consulted on many natural language processing applications for her customers, and written over 75 technical papers. Th is is Deborah's fourth book on natural language understanding topics. Deborah has a PhD in linguistics from the University of Minnesota and postdoctoral studies in cognitive science from the University of Pennsylvania.
Read more about Deborah A. Dahl

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages