Introduction to NLP
“Why do we need NLP?” You may ask this question as you've witnessed the advancement of natural language processing (NLP) in recent years. Let’s see how NLP helped a well-established investment firm named "Harmony Investments." For decades, Harmony Investments had been renowned for its astute financial strategies and portfolio management, ranging from stocks and bonds to real estate and alternative investments. However, the sheer volume and variety of data sources, including news articles, earnings reports, social media posts, and financial statements, made it nearly impossible to manually analyze all the information. The firm's analysts were spending an excessive amount of time collecting and reviewing data. Recognizing the need for a more efficient and data-driven approach, the firm partnered with a leading AI solutions provider to implement NLP-driven solutions into their business operations. They used NLP algorithms to review news articles, press releases, and social media platforms in real time. This analysis enabled the firm to react swiftly. They used NLP tools that automatically summarized lengthy earning reports. This reduced the time the analysts spent on manual document review. They used NLP-powered sentiment analysis to gauge public sentiment surrounding specific stocks or market segments. Analysts had more time for strategic research and developing innovative investment strategies. As a result, Harmony Investments not only retained its reputation as a leading investment firm but also attracted new clients and expanded its portfolio.
Joe is a data scientist who is new to NLP. He and his data analyst colleague, Jacob, are interested in learning NLP techniques. They want to acquire the NLP techniques that can deliver the NLP benefits as discussed. They have certainly heard of ChatGPT and all the news about large language models (LLMs). They want to learn NLP systematically, from concepts to practice, and want to find a textbook that can bridge them to LLMs without diving into LLMs first. If you are like Joe or Jacob, then this book is for you.
A fundamental step in NLP for computers to understand texts is text representation, which convert a collection of text documents into numerical values. Each document is represented as a vector in a high-dimensional space, where each dimension corresponds to a unique word in the entire corpus. This helps computers understand what words mean and how they relate to each other in sentences. This book starts with bag-of-words (BoW), bag-of-N-grams, term frequency-inverse document frequency (TF-IDF). An advance to text representation is the word embedding techniques. Word embeddings are dense vector representations of words that capture semantic relationships between words based on their context in a large dataset. Word embeddings, like Word2Vec, create continuous vector representations where words with similar meanings have similar vector representations, and they capture semantic and syntactic relationships.
Topic modeling is a significant NLP subject. It classifies documents into topics for document retrieval, categorization, tagging, or annotation. This book gives more insight into the milestone topic modeling technique, Latent Dirichlet Allocation (LDA). In addition, another milestone topic modeling technique is BERTopic. Let me briefly describe the development history of Bidirectional Encoder Representations from Transformers (BERT). The seminal paper “Attention is all you need” by Vaswani et al.  enables many transformer-based word embeddings and LLMs. One of the word embeddings is BERT. Can we do topic modeling to classify documents based on BERT word embeddings? That’s the origin of BERTopic. I have included BERTopic in this book together with LDA so you get to see the differences. This will provide a bridge to the transformer-based NLP techniques.
This book is a practical handbook with code snippets. I will cover many techniques in the
Gensim library. Gensim is an open source Python library for topic modeling, document clustering, and other unsupervised learning tasks on collections of textual documents. It provides a high-level interface for building and training a variety of models. Gensim stands for generate similar. It finds the similarities between documents to summarize texts or to classify documents into topics.
In this chapter, we will cover the following topics:
- Introduction to natural language processing
- NLU + NLG = NLP
- Gensim and its NLP modeling techniques
- Topic modeling with BERTopic
- Common NLP Python modules included in this book
After completing this chapter, you will get to know the development history of NLP. You will be able to explain the key NLP techniques that Gensim covers. You will also understand other popular NLP Python libraries that are often used together.
Introduction to natural language processing
NLP is based on 50 years of rich research into linguistics and processing algorithms. It is a branch of computer science or artificial intelligence (AI) that uses computer algorithms to analyze, understand, and generate human language data. The algorithms process human language to “understand” its full meaning. NLP has a wide range of applications that include the following:
- Text mining: Extracting information from large amounts of text data, such as documents, emails, and social media posts.
- Information retrieval: Searching for relevant information in large text databases. In this book, you will learn many techniques for information retrieval.
- Question answering: Answering questions posed in natural language.
- Machine translation: Translating text from one language to another.
- Sentiment analysis: Identifying the tone and emotion of text data.
- Natural language generation (NLG): Generating text that mimics human language.
As I said before, NLP has a long development history. Let’s look into it briefly.
NLU + NLG = NLP
NLP is an umbrella term that covers natural language understanding (NLU) and NLG. We’ll go through both in the next sections.
Many languages, such as English, German, and Chinese, have been developing for hundreds of years and continue to evolve. Humans can use languages artfully in various social contexts. Now, we are asking a computer to understand human language. What’s very rudimentary to us may not be so apparent to a computer. Linguists have contributed much to the development of computers’ understanding in terms of syntax, semantics, phonology, morphology, and pragmatics.
NLU focuses on understanding the meaning of human language. It extracts text or speech input and then analyzes the syntax, semantics, phonology, morphology, and pragmatics in the language. Let’s briefly go over each one:
- Syntax: This is about the study of how words are arranged to form phrases and clauses, as well as the use of punctuation, order of words, and sentences.
- Semantics: This is about the possible meanings of a sentence based on the interactions between words in the sentence. It is concerned with the interpretation of language, rather than its form or structure. For example, the word “table” as a noun can refer to “a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs” or a data frame in a computer language.
Let’s elaborate more on semantics with two jokes. The first example is as follows:
Patient: “Doctor, doctor! I’ve broken my arm in three places!”
Doctor: “Well, stop going to those places, then.”
The patient uses the word places to mean the spots on the arm; the doctor uses places to mean physical locations.
The second example is as follows:
My coworker: “Do you ever think about working from home?”
Me: “I don’t even think about work at work!”
The first work in my reply means the tasks in my work. The second work in at work means at one’s place of employment.
NLU can understand the two meanings of a word in such jokes through a technique called word embedding. We will learn more about this in Chapter 2, Text Representation.
- Phonology: This is about the study of the sound system of a language, including the sounds of speech (phonemes), how they are combined to form words (morphology), and how they are organized into larger units such as syllables and stress patterns. For example, the sounds represented by the letters “p” and “b” in English are distinct phonemes. A phoneme is the smallest unit of sound in a language that can change the meaning of a word. Consider the words “pat” and “bat.” The only difference between these two words is the initial sound, but their meanings are different.
- Morphology: This is the study of the structure of words, including the way in which they are formed from smaller units of meaning called morphemes. It originally comes from “morph,” the shape or form, and “ology,” the study of something. Morphology is important because it helps us understand how words are formed and how they relate to each other. It also helps us understand how words change over time and how they are related to other words in a language. For example, the word “unkindness” consists of three separate morphemes: the prefix “un-,” the root “kind,” and the suffix “-ness.”
- Pragmatics: This is the study of how language is used in a social context. Pragmatics is important because it helps us understand how language works in real-world situations, and how language can be used to convey meaning and achieve specific purposes. For example, if you offer to buy your friend a McDonald’s burger, a large fries, and a large drink, your friend may reply "no" because he is concerned about becoming fat. Your friend may simply mean the burger meal is high in calories, but the conversation can also imply he may be fat in a social context.
Now, let’s understand NLG.
While NLU is concerned with reading for a computer to comprehend, NLG is about writing for a computer to write. The term generation in NLG refers to an NLP model generating meaningful words or even articles. Today, when you compose an email or type a sentence in an app, it presents possible words to complete your sentence or performs automatic correction. These are applications of NLG. As a result, the term generative AI is coined for generative models including language, voice, image, and video generation. ChatGPT and GPT-4 from OpenAI are probably the most famous examples of generative AI. I will briefly introduce ChatGPT, GPT-4, and other open source products, such as gpt.h2o.ai, and present the HuggingFace.co open source community.
ChatGPT and GPT-4
You can enter a prompt for ChatGPT to generate a poem, a prompt, a story, and so on. With its use of the reinforcement learning from human feedback (RLHF) technique, ChatGPT is designed to respond to questions in a way that sounds natural and human-like, making it easier for people to communicate with the model. It is also able to remember the conversations prior to the current conversation. If you ask ChatGPT, “Should I wear a blue shirt or a white shirt for tomorrow’s outdoor company meeting when it is likely to be hot and sunny?”, it will formulate a response by inferring a sequence of words that are likely to come next. It answers, “When it is hot and sunny, you may want to wear a white shirt.” If you reply, “How about on a gloomy day?”, ChatGPT understands that you mean to ask, in continuation to your prior question, about what to wear for tomorrow’s outdoor company meeting. It does not take the second question as an independent question and answer it randomly. This ability to remember and contextualize inputs is what gives ChatGPT the ability to carry on some semblance of a human conversation rather than give naïve, one-off answers. Hence, having the memory for long context is the key for next-generation language models such as GPT-4.
Generative Pre-trained Transformer 4 (GPT-4) was released by OpenAI on March 14, 2023. It is a transformer-based model that can predict the next word or token. It can answer questions, summarize text, translate text to other languages, and generate code, blog posts, stories, conversations, and other content types. The ability to remember and contextualize inputs, known as the context window is key for a language model. The context window of GPT-4 has been increased to roughly 8,000 words. Given the English conversation speed is about 120 words per minute and 7,200 words per hour, a context window of 8,000 words is more than an hour’s conversation.
What is the token or transformer mentioned in the previous paragraph? A token is the NLP term referring to a word separated from a sentence, or tokenized. A transformer is the milestone language model invented in 2017 by Vaswani et al . In Chapter 2, Text Representation, I will explain how tokenization works in text preprocessing. In Chapter 13, The Ensemble LDA for Model Stability, I will give high-level explanations of a transformer model and others including the BERTopic model that builds on it.
GPT-H2O.ai is an LLM by H2O.ai. It is aimed at enterprise-grade LLMs such as ChatGPT, but is open source. I let gpt.h2o.ai define itself by asking “What is gpt.h2o.ai?” It answered: “gpt.h2o.ai is a large language model that was trained on a combination of web crawl data and Reddit comments. The model can generate human-like text and has been used for a variety of tasks, including generating code, answering questions, and summarizing text.”
Let’s look at its storytelling capability. I entered: “Write a fable on a lion that rescued a boy when the boy was under the attack of aliens.” It returned the following interesting story:
“Once upon a time, there was a young boy named Jack who loved to explore the woods near his home. One day, while he was out on one of his adventures, he came across a strange sight. There, in the middle of the forest, was a spaceship. It was unlike anything Jack had ever seen before. As he approached the ship, he heard a strange noise coming from inside. Jack peered through the window and saw a group of aliens inside. They were holding a young boy captive and seemed to be doing something to him. Jack knew he had to act quickly. Just then, a lion appeared out of nowhere. The lion roared loudly, scaring the aliens away. The boy was freed and the spaceship took off into the sky. The boy thanked the lion for saving him and asked how he knew to come to his rescue. The lion explained that it had been watching over him for some time and knew that he was in danger. The boy and the lion became friends. Jack never forgot the day that he saw the lion save the boy from the aliens.”
New research on LLMs has burgeoned over the past few years. New LLMs, datasets, and solutions become hard to keep up with and create a new knowledge barrier for anyone who wants to enter the world of generative AI. Hugging Face provides a platform called HuggingFace.co that allows everyone to access open source LLMs, academic papers, and various datasets that trained the LLMs on its platform. You can also share your LLMs and datasets.
You may ask why LLMs can answer questions, write book reports, draft notes, or summarize documents. An important factor is the data on which they were trained or with which they were fine-tuned. These large-scale datasets include “News/Wikipedia,” “Web crawling,” “Questions and Answers (Q&A),” “Books,” and “Reading comprehension.” In “Large Language Model Datasets” , I have given a more detailed review of the datasets that trained prominent LLMs such as GPT-2, GPT-3, GPT-4, and so on.
I trust you have gained a better understanding of NLU and NLG. Next, I will introduce the NLP techniques covered by
Gensim and its NLP modeling techniques
Gensim is actively maintained and supported by a community of developers and is widely used in academic research and industry applications. It covers many important NLP techniques that make up the workforce of today’s NLP. That’s one of the reasons why I have developed this book to help data scientists.
Last year, I was at a company’s year-end party. The ballroom was filled with people standing in groups with their drinks. I walked around and listened for conversation topics where I could chime in. I heard one group talking about the FIFA World Cup 2022 and another group talking about stock markets. I joined the stock markets conversation. In that short moment, my mind had performed “word extractions,” “text summarization,” and “topic classifications.” These tasks are the core tasks of NLP and what Gensim is designed to do.
We perform serious text analyses in professional fields including legal, medical, and business. We organize similar documents into topics. Such work also demands “word extractions,” “text summarization,” and “topic classifications.” In the following sections, I will give you a brief introduction to the key models that Gensim offers so you will have a good overview. These models include the following:
- BoW and TF-IDF
- Latent semantic analysis/indexing (LSA/LSI)
- Text summarization
- Ensemble LDA
BoW and TF-IDF
- Phrase 1: All the stars we steal from the night sky
- Phrase 2: Will never be enough, never be enough, never be enough for me
The BoW presents the word count frequency as shown in Figure 1.1. For example, the word the in the first sentence appears twice, so it is coded as 2; the word be in the second sentence appears three times, so it is coded as 3:
Figure 1.1 – BoW encoding (also presented in the next chapter)
BoW uses the word count to reflect the significance of a word. However, this is not very intuitive. Frequent words may not carry special meanings depending on the type of document. For example, in clinical reports, the words physician, patient, doctor, and nurse appear frequently. The high frequency of these words may overshadow specific words such as bronchitis or stroke in a patient’s document. A better encoding system is to compare the relative word appearance in a document to its appearance throughout the corpus. TF-IDF is designed to reflect the importance of a word in a document by calculating its relevance to a corpus. We will learn the details of this in Chapter 2, Text Representation. At this moment, you just need to know that both BoW and TF-IDF are variations of text representation. They are the building blocks of NLP.
Although BoW and TF-IDF appear simple, they already have real-world applications in different fields. An important application of BoW and TF-IDF is to prevent spam emails from going to the inbox folder of an email account. Spam emails are ubiquitous, unavoidable, and quickly fill up the spam folder. BoW or TF-IDF will help to distinguish the characteristics of a spam email from regular emails.
Suppose you were a football fan and searching in a library using the keywords famous World Cup players, and that the system can only do exact key word match. The old computer system returned all articles that contained famous, world, cup, or players. It also returned a lot of other unrelated articles such as famous singer, 150 most famous poems, and world-renowned scientist. This is terrible, isn’t it? A simple keyword match cannot serve as a search engine.
Latent semantic analysis (LSA) was developed in the 1990s. It's an NLP solution that far surpasses naïve keyword matching and has become an important search engine algorithm. Prior to that, in 1988, an LSA-based information retrieval system was patented (US Patent #4839853, now expired) and named “latent semantic indexing,” so the technique is also called latent semantic indexing (LSI). Gensim and many other reports name LSA as LSI so as not to confuse LSA with LDA. In this book, I will adopt the same naming convention. In Chapter 6, Latent Semantic Indexing with Gensim, I will show you the code example to build an LSI model.
You can search with keywords such as the following:
This can return relevant news articles. One of the results is as follows:
Notice it searches by meaning but not by word matching.
The Word2Vec technique developed by Mikolov et al.  in 2014 was a significant milestone in NLP. Its idea was ground-breaking — it embeds words or phrases from a text corpus as dense, continuous-valued vectors, hence the name word-to-vector. These vector representations capture semantic relationships and contextual information between words. Its applications are prevalent in many recommendation systems. Figure 1.2 shows other words that are close to the word iron including gunpowder, metals, and steel; words far from iron are organic, sugar, and grain:
Figure 1.2 – An overview of Word2Vec (also presented in Chapter 7)
Also, the relative distance of words measures the similarity of meanings. Word2Vec enables us to measure and visualize the similarities or dissimilarities of words or concepts. This is a fantastic innovation.
Can you see how this idea can also apply to movie recommendations? Each movie can be considered a word in someone’s watching history. I Googled the words movie recommendations, and it returned many movies under “Top picks for you”:
Figure 1.3 – An overview of Word2Vec as a movie recommendation system (also presented in Chapter 7)
Word2Vec represents a word with a vector. Can we represent a sentence or a paragraph with a vector? Doc2Vec is designed to do so. Doc2Vec transforms articles into vectors and enable semantic search for related articles. Doc2Vec has enabled many commercial products. For example, when you search for a job on LinkedIn.com or Indeed.com, you see similar job postings presented next to your target job posting. It is done by Doc2Vec. In Chapter 8, Doc2Vec with Gensim, you will build a real Doc2Vec model with code examples.
When documents are tagged by topic, we can retrieve the documents easily. In the old days, if you went to a library for books of a certain genre, you used the indexing system to find them. Now, with all the digital content, documents can be tagged systematically by topic modeling techniques.
The preceding library example may be an easier one, if compared to all sorts of social media posts, job posts, emails, news articles, or tweets. Topic models can tag digital content for effective searching or retrieving. LDA is an important topic modeling technique and has many commercial use cases. Figure 1.4 shows a snapshot of the LDA model output that we will build in Chapter 11, LDA Modeling:
Figure 1.4 – pyLDAvis (also presented in Chapter 11)
Each bubble represents a topic. The distance between any two bubbles represents the difference between the two topics. The red bars on top of the blue bars represent the estimated frequency of a word for a chosen topic. Documents on the same topic are similar in their content. If you are reading a document that belongs to Topic 75 and want to read more related articles, LDA can return other articles on Topic 75.
The goal of topic modeling for a set of documents is to find topics that are reliable and reproducible. If you replicate the same modeling process for the same documents, you expect to produce the same set of topics. However, past experiments have shown that while iterations for the same model produce the same set of topics, some iterations can produce extra topics. This creates a serious issue in practice: Which model outcome is the correct one to use? This issue seriously limits the applications of LDA. So, we think of the ensemble method in machine learning. In machine learning, an ensemble is a technique that combines multiple individual models to improve predictive accuracy and generalization performance. Ensemble LDA builds many models to identify a core set of topics that is reliable and reproducible all the time. In Chapter 13, The Ensemble LDA for Model Stability, I will explain the algorithm with visual aids in more detail. We also will build our own model with code examples.
Topic modeling with BERTopic
BERTopic is a topic modeling algorithm that is based on the BERT word embeddings. In Chapter 14, LDA and BERTopic, we will learn the key components of BERTopic and build our own model. In addition, the BERTopic modeling has its own visualization functions that are similar to pyLDAvis, as seen in Figure 1.4. We will learn to use all the visualization functions as well.
Figure 1.5 shows the top words for eight topics:
Figure 1.5 – An overview of topic modeling results by BERTopic (also presented in Chapter 14)
I trust these introductions have given you a strong appetite to dive into each chapter and apply the models discussed in your future work. Now, let's get familiar with the terminology commonly used in NLP.
Common NLP Python modules included in this book
This book includes a few Python modules for the best learning outcomes. If an NLP task can be performed by other libraries, such as
NLTK, I will show you the code examples for comparison. The libraries included in this book are detailed in the following sections.
spaCy is by far the best production-level, open source library for NLP. It makes many processing tasks easy with reliable code and outcomes. If you work with a large volume of texts for text preprocessing, spaCy is an excellent choice. It is designed to be a simple and concise alternative to C.
It can perform a wide range of NLP operations well. These NLP operations include the following tasks:
- Tokenization: This breaks text into individual words or tokens. To a computer, a sentence is just a string of characters. The string has to be separated into words.
- Part-of-speech (PoS) tagging: This assigns grammatical labels to each word in a sentence. For example, the sentence “She loves the beautiful flower” has a pronoun (“she”), a verb (“loves”), an adjective (“beautiful”), and a noun (“flower”). The labeling for the pronoun, verb, adjective, and noun is called PoS tagging.
- Named entity recognition (NER): This identifies named entities such as names, organizations, locations, and so on. For example, in the sentence “I went to New York City on July 4th,” the named entities would be “New York City” (a place), and “July 4th” (a date). It is worth mentioning that spaCy’s built-in NER models are based on the BERT architecture. As we will learn about BERT in this book, it is helpful to be aware of this.
- Lemmatization: This reduces words to their base or dictionary form. We will learn more about lemmatization in Chapter 3, Text Wrangling and Preprocessing.
- Rule-based matching: This can find sequences of words based on user-defined rules.
- Word vectors: These represent words as numerical vectors. When two words become vectors, they can be compared in the vector space. Word embedding and vectorization is an important step in NLP. spaCy provides the functions to do so. We will learn about the concept and practice of word vectorization in Chapter 7, Using Word2Vec.
These are just some of the main capabilities of spaCy, and it offers many more features and functionalities for NLP tasks.
NLTK is an open source Python library for natural language processing. It provides a suite of tools for working with text data, including tokenization, PoS tagging, and NER. It provides interfaces to over 50 corpora and lexical resources, such as WordNet. NLTK also includes a number of pre-trained models for tasks such as sentiment analysis and topic modeling. It is widely used in academia and industry for research and development in NLP. NLTK can perform a range of NLP tasks too, including PoS, NER, sentiment analysis, text classification, and text summarization.
This chapter provided a landscape view of the NLP topics covered in this book. We learned that the development of NLP was due to the success of NLU and NLG. Then, we surveyed the NLP techniques that are covered by Gensim. The main techniques include BoW, TF-IDF, LSA/LSI, Word2Vec, Doc2Vec, LDA, and Ensemble LDA. We were also introduced to BERTopic modeling. We then learned about the other two popular NLP Python libraries, spaCy and NLTK.
As we all know, a computer operates on zeros and ones but cannot comprehend the great works of Shakespeare. So, how do ChatGPT and other language models understand language? The very first step is to convert words to numerical values. The next chapter will teach you about text representation.
- Describe natural language processing (NLP).
- What is natural language understanding (NLU)?
- What is natural language generation (NLG)?
- List some of the NLP modeling techniques used by Gensim.
- List some of the most used NLP Python modules.
- Once you have answered the previous questions, let’s access https://chat.openai.com/ to search for answers. This time, let’s key in a question, called a “prompt,” to get answers. You are encouraged to experiment with ChatGPT with variations of the questions. For example, you can test the following for Question 1:
- “Please describe natural language processing.”
- “Please describe NLP to a high schooler.”
- “Please describe NLP in one paragraph.”
- “Please describe NLP with an analogy.”
- Wei, Low De (2022, December 2). This AI Chatbot Is Blowing People’s Minds. Here’s What It’s Been Writing. Bloomberg.com. https://www.bloomberg.com/news/articles/2022-12-02/chatgpt-openai-s-new-essay-writing-chatbot-is-blowing-people-s-minds?leadSource=uverify%20wall
- Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv, abs/1706.03762.
- Kuo, Chris, (2023) Large Language Model Datasets, May 9, 2023, https://dataman-ai.medium.com/large-language-model-datasets-95df319a110
- Mikolov, T., Chen, K., Corrado, G.S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. International Conference on Learning Representations. https://arxiv.org/abs/1310.4546
- Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv, abs/1810.04805.
- Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T.J., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models are Few-Shot Learners. ArXiv, abs/2005.14165.