Natural language processing, or NLP, is quite popular in the scientific community, but the value of using this Artificial Intelligence (AI) technique to gain business benefits is not immediately obvious to mainstream users. Our focus will be to raise awareness and educate you on the business context of NLP, provide examples of the proliferation of data in unstructured text, and show how NLP can help derive meaningful insights to inform strategic decisions within an enterprise.
In this introductory chapter, we will be establishing the basic context to familiarize you with some of the underlying concepts of AI and Machine Learning (ML), the types of challenges that NLP can help solve, common pitfalls when building NLP solutions, and how NLP works and what it's really good at doing, with examples.
In this chapter, we will cover the following:
- Introducing NLP
- Overcoming the challenges in building NLP solutions
- Understanding why NLP is becoming mainstream
- Introducing the AWS ML stack
Language is as old as civilization itself and no other communication tool is as effective as the spoken or written word. In their childhood days, the authors were enamored with The Arabian Nights, a centuries-old collection of stories from India, Persia, and Arabia. In one famous story, Ali Baba and the Forty Thieves, Ali Baba is a poor man who discovers a thieves' den containing hordes of treasure hidden in a cave that can only be opened by saying the magic words open sesame. In the authors' experience, this was the first recollection of a voice-activated application. Though purely a work of fiction, it was indeed an inspiration to explore the art of the possible.
Recently, in the last two decades, the popularity of the internet and the proliferation of smart devices has fueled significant technological advancements in digital communications. In parallel, the long-running research to develop AI made rapid strides with the advent of ML. Arthur Lee Samuel was the first to coin the term machine learning, in 1959, and helped make it mainstream in the field of computer science by creating a checkers playing program that demonstrated how computers can be taught.
The concept that machines can be taught to mimic human cognition, though, was popularized a little earlier in 1950 by Alan Turing in his paper Computing Machinery and Intelligence. This paper introduced the Turing Test, a variation of a common party game of the time. The purpose of the test was for an interpreter to ask questions and compare responses from a human participant and a computer. The trick was that the interpreter was not aware which was which, considering all three were isolated in different rooms. If the interpreter was unable to differentiate the two participants because the responses matched closely, the Turing Test had successfully validated that the computer possessed AI.
Of course, the field of AI has progressed leaps and bounds since then, largely due to the success of ML algorithms in solving real-world problems. An algorithm, at its simplest, is a programmatic function that converts inputs to outputs based on conditions. In contradiction to regular programmable algorithms, ML algorithms have learned the ability to alter their processing based on the data they encounter. There are different ML algorithms to choose from based on requirements, for example, Extreme Gradient Boosting (XGBoost), a popular algorithm for regression and classification problems, Exponential Smoothing (ETS), for statistical time series forecasting, Single Shot MultiBox Detector (SSD), for computer vision problems, and Latent Dirichlet Allocation (LDA), for topic modeling in NLP problems.
For more complex problems, ML has evolved into deep learning with the introduction of Artificial Neural Networks (ANNs), which have the ability to solve highly challenging tasks by learning from massive volumes of data. For example, AWS DeepComposer (https://aws.amazon.com/deepcomposer/), an ML service from Amazon Web Services (AWS), educates developers with music as a medium of instruction. One of the ML models that DeepComposer uses is trained with a type of neural network called the Convolutional Neural Network (CNN) to create new and unique musical compositions from a simple input melody using AutoRegressive (AR) techniques:
A piano roll is an image representation of music, and AR-CNN considers music generation as a sequence of these piano roll images:
While there is broad adoption of ML across organizations of all sizes and industries spurred by the democratization of advanced technologies, the potential to solve many types of problems, and the breadth and depth of capabilities in AWS, ML is only a subset of what is possible today with AI. According to one report (https://www.gartner.com/en/newsroom/press-releases/2019-01-21-gartner-survey-shows-37-percent-of-organizations-have, accessed on March 23, 2021), AI adoption grew by 270% in the period 2015 to 2019. And it is continuing to grow at a rapid pace. AI is no longer a peripheral technology only available to those enterprises that have the economic resources to afford high-performance computers. Today, AI is a mainstream option for organizations looking to add cognitive intelligence to their applications to accelerate business value. For example, ExxonMobil in partnership with Amazon created an innovative and efficient way for customers to pay at gas stations. The Alexa pay for gas skill uses the car's Alexa-enabled device or your smartphone's Alexa app to communicate with the gas pump to manage the payment. The authors paid a visit to a local ExxonMobil gas station to try it out, and it was an awesome experience. For more details, please refer to https://www.exxon.com/en/amazon-alexa-pay-for-gas.
AI addresses a broad spectrum of tasks similar to human intelligence, both sensory and cognitive. Typically, these are grouped into categories, for example, computer vision (mimics human vision), NLP (mimics human speech, writing, and auditory processes), conversational interfaces (such as chatbots, mimics dialogue-based interactions), and personalization (mimics human intuition). For example, C-SPAN, a broadcaster that reports on proceedings at the US Senate and the House of Representatives, uses Amazon Rekognition (a computer vision-based image and video analysis service) to tag who is speaking/on camera at each time. With Amazon Rekognition, C-SPAN was able to index twice as much content compared to what they were doing previously. In addition, AWS offers AI services for intelligent search, forecasting, fraud detection, anomaly detection, predictive maintenance, and much more, which is why AWS was named the leader in the first Gartner Magic Quadrant for Cloud AI.
While language is inherently structured and well defined, the usage or interpretation of language is subjective, and may inadvertently cause an unintended influence that you need to be cognizant of when building natural language solutions. Consider, for example, the Telephone Game, which shows how conversations are involuntarily embellished, resulting in an entirely different version compared to how it began. Each participant repeats exactly what they think they heard, but not what they actually heard. It is fun when played as a party game but may have more serious repercussions in real life. Computers, too, will repeat what they heard, based on how their underlying ML model interprets language.
To understand how small incremental changes can completely change the meaning, let's look at another popular game: Word Ladder (https://en.wikipedia.org/wiki/Word_ladder). The objective is to convert one word into a different word, often one with the opposite meaning, in as few steps as possible with only one letter in the word changing in one step.
An example is illustrated in the following table:
Adapting AI to work with natural language resulted in a group of capabilities that primarily deal with computational emulation of cognitive and sensory processes associated with human speech and text. There are two main categories that applications can be grouped into:
- Natural Language Understanding (NLU), for voice-based applications such as Amazon Alexa, and speech-to-text/text-to-speech conversions
- NLP, for the interpretation of context-based insights from text
With NLU, applications that hitherto needed multiple and sometimes cumbersome interfaces, such as a screen, keyboard, and a mouse to enable computer-to-human interactions, can work as efficiently with just voice.
In Stanley Kubrick's 1968 movie 2001: A Space Odyssey, (spoiler alert!!) an artificially intelligent computer known as the HAL 9000 uses vision and voice to interact with the humans on board, and in the course of the movie, develops a personality, does not accept when it is in error, and attempts to kill the humans when it discovers their plot to shut it down. Fast forward to now, 20 years after the future depicted in the movie, and we have made significant progress in language understanding and processing, but not to the extreme extent of the dramatization of the plot elements used in the movie, thankfully.
Now that we have a good understanding of the context in which NLP has developed and how it can be used, let's try examining some of the common challenges you might face while developing NLP solutions.
Overcoming the challenges in building NLP solutions
We read earlier that the main difference between the algorithms used for regular programming and those used for ML is the ability of ML algorithms to modify their processing based on the input data fed to them. In the NLP context, as in other areas of ML, these differences add significant value and accelerate enterprise business outcomes. Consider, for example, a book publishing organization that needs to create an intelligent search capability displaying book recommendations to users based on topics of interest they enter.
In a traditional world, you would need multiple teams to go through the entire book collection, read books individually, identify keywords, phrases, topics, and other relevant information, create an index to associate book titles, authors, and genres to these keywords, and link this with the search capability. This is a massive effort that takes months or years to set up based on the size of the collection, the number of people, and their skill levels, and the accuracy of the index is prone to human error. As books are updated to newer editions, and new books are added or removed, this effort would have to be repeated incrementally. This is also a significant cost and time investment that may deter many unless that time and those resources have already been budgeted for.
To bring in a semblance of automation in our previous example, we need the ability to digitize text from documents. However, this is not the only requirement, as we are interested in deriving context-based insights from the books to power a recommendations index for a reader. And if we are talking about, for example, a publishing house such as Packt, with 7,500+ books in its collection, we need a solution that not only scales to process large numbers of pages, but also understands relationships in text, and provides interpretations based on semantics, grammar, word tokenization, and language to create smart indexes. We will cover a detailed walkthrough of this solution, along with code samples and demo videos, in Chapter 5, Creating NLP Search.
Today's enterprises are grappling with leveraging meaningful insights from their data primarily due to the pace at which it is growing. Until a decade or so, most organizations used relational databases for all their data management needs, and some still do even today. This was fine because the data volume need was in single-digit terabytes or less. In the last few years, the technology landscape has witnessed a significant upheaval with smartphones becoming ubiquitous, the large-scale proliferation of connected devices (in the billions), the ability to dynamically scale infrastructure in size and into new geographies, and storage and compute costs becoming cheaper due to the democratization offered by the cloud. All of this means applications get used more often, have much larger user bases, more processing power, and capabilities, can accelerate their pace of innovation with faster go-to-market cycles, and as a result, have a need to store and manage petabytes of data. This, coupled with application users demanding faster response times and higher throughput, has put a strain on the performance of relational databases, fueling a move toward purpose-built databases such as Amazon DynamoDB, a key-value and document database that delivers single-digit millisecond latency at any scale.
While this move signals a positive trend, what is more interesting is how enterprises utilize this data to gain strategic insights. After all, data is only as useful as the information we can glean from it. We see many organizations, while accepting the benefits of purpose-built tools, implementing these changes in silos. So, there are varying levels of maturity in properly harnessing the advantages of data. Some departments use an S3 data lake (https://aws.amazon.com/products/storage/data-lake-storage/) to source data from disparate sources and run ML to derive context-based insights, others are consolidating their data in purpose-built databases, while the rest are still using relational databases for all their needs.
You can see a basic explanation of the main components of a data lake in the following Figure 1.5, An example of an Amazon S3 data lake:
Let's see how NLP can continue to add business value in this situation by referring back to our book publishing example. Suppose we successfully built our smart indexing solution, and now we need to update it with book reviews received via Twitter feeds. The searchable index should provide book recommendations based on review sentiment (for example, don't recommend a book if reviews are negative > 50% in the last 3 months). Traditionally, business insights are generated by running a suite of reports on behemoth data warehouses that collect, mine, and organize data into marts and dimensions. A tweet may not even be under consideration as a data source. These days, things have changed and mining social media data is an important aspect of generating insights. Setting up business rules to examine every tweet is a time-consuming and compute-intensive task. Furthermore, since a tweet is unstructured text, a slight change in semantics may impact the effectiveness of the solution.
Now, if you consider model training, the infrastructure required to build accurate NLP models typically uses the deep learning architecture called Transformers (please see https://www.packtpub.com/product/transformers-for-natural-language-processing/9781800565791) that use sequence-to-sequence processing without needing to process the tokens in order, resulting in a higher degree of parallelization. Transformer model families use billions of parameters with the training architecture using clusters of instances for distributed learning, which adds to time and costs.
AWS offers AI services that allow you, with just a few lines of code, to add NLP to your applications for the sentiment analysis of unstructured text at an almost limitless scale and immediately take advantage of the immense potential waiting to be discovered in unstructured text. We will cover AWS AI services in more detail from Chapter 2, Introducing Amazon Textract, onward.
In this section, we reviewed some challenges organizations encounter when building NLP solutions, such as complexities in digitizing paper-based text, understanding patterns from structured and unstructured data, and how resource-intensive these solutions can be. Let's now understand why NLP is an important mainstream technology for enterprises today.
Understanding why NLP is becoming mainstream
According to this report (https://www.marketsandmarkets.com/Market-Reports/natural-language-processing-nlp-825.html, accessed on March 23, 2021), the global NLP market is expected to grow to USD 35.1 billion by 2026, at a Compound Annual Growth Rate (CAGR) of 20.3% during the forecast period. This is not surprising considering the impact ML is making across every industry (such as finance, retail, manufacturing, energy, utilities, real estate, healthcare, and so on) in organizations of every size, primarily driven by the advent of cloud computing and the economies of scale available.
This article about Emergence Cycle (https://blogs.gartner.com/anthony_bradley/2020/10/07/announcing-gartners-new-emergence-cycle-research-for-ai/), a research into emerging technologies in NLP (based on patents submitted, and looking at technology still in labs or recently released), shows the most mature usage of NLP is multimedia content analysis. This trend is true based on our experience, and content analysis to gain strategic insights is a common NLP requirement based on our discussions with a number of organizations across industries:
For example, in 2020, when the world was struggling with the effects of the pandemic, a number of organizations adopted AI and specifically NLP to power predictions on the virus spread patterns, assimilate knowledge on virus behavior and vaccine research, and monitor the effectiveness of safety measures, to name a few. In April 2020, AWS launched an NLP-powered search site called https://cord19.aws/ using an AWS AI service called Amazon Kendra (https://aws.amazon.com/kendra/). The site provides an easy interface to search the COVID-19 Open Research Dataset using natural language questions. As the dataset is constantly updated based on the latest research on COVID-19, CORD-19 Search, due to its support for NLP, makes it easy to navigate this ever-expanding collection of research documents and find precise answers to questions. The search results provide not only specific text that contains the answer to the question but also the original body of text in which these answers are located:
Fred Hutchinson Cancer Research Center is a research institute focused on curing cancer by 2025. Matthew Trunnell, Chief Information Officer of Fred Hutchinson Cancer Research Center, has said the following:
For more details and usage examples of Amazon Comprehend and Amazon Comprehend Medical, please refer to Chapter 3, Introducing Amazon Comprehend.
So, how can AI and NLP help us cure cancer or prepare for a pandemic? It's about recognizing patterns where none seem to exist. Unstructured text, such as documents, social media posts, and email messages, is similar to the treasure waiting in Ali Baba's cave. To understand why, let's briefly look at how NLP works.
NLP models train by learning what are called word embeddings, which are vector representations of words in large collections of documents. These embeddings capture semantic relationships and word distributions in documents, thereby helping to map the context of a word based on its relationship to other words in the document. The two common training architectures for learning word embeddings are Skip-gram and Continuous Bag of Words (CBOW). In Skip-gram, the embeddings of the input word are used to derive the distribution of the related words to predict the context, and in CBOW, the embeddings of the related words are used to predict the word in the middle. Both are neural network-based architectures and work well for context-based analytics use cases.
Now that we understand the basics of NLP (analyzing patterns in text by converting words to their vector representations), when we look at training models using text data from disparate data sources, unique insights are often derived due to patterns that emerge that previously appeared hidden when looked at within a narrower context, because we are using numbers to find relationships in text. For example, The Rubber Episode in the Amazon Prime TV show This Giant Beast That Is The Global Economy shows how a fungal disease has the potential to devastate the global economy, even though at first it might appear there is no link between the two. According to the US National Library of Medicine, natural rubber accounts for 40% of the world's consumption, and the South American Leaf Blight (SALB) fungal disease has the potential to spread worldwide and severely inhibit rubber production. Airplanes can't land without rubber, and its uses are so myriad that it would have unprecedented implications on the economy. This an example of a pattern that ML and NLP models are so good at finding specific items of interest across vast text corpora.
- Lack of skills: Expertise in identifying data, feature engineering, building models, training, and tuning are all tasks that require a unique combination of skills, including software engineering, mathematics, statistics, and data engineering, that only a few practitioners have.
- Initial infrastructure setup cost: ML training is an iterative process, often requiring a trial-and-error approach to tune the models to get the desired accuracy. Further training and inference may require GPU acceleration based on the volume of data and the number of requests, requiring a high initial investment.
- Scalability with the current on-premises environment: Running ML training and inference from on-premises servers constrains the elasticity required to scale compute and storage based on model size, data volumes, and inference throughput needed. For example, training large-scale transformer models may require massively parallel clusters, and capacity planning for such scenarios is challenging.
- Availability of tools to help orchestrate the various moving parts of NLP training: As mentioned before, the ML workflow comprises many tasks, such as data discovery, feature engineering, algorithm selection, model building, which includes training and fine-tuning the models several times, and finally deploying those models into production. Furthermore, getting an accurate model is a highly iterative process. Each of these tasks requires purpose-built tools and expertise to achieve the level of efficiency needed for good models, which is not easy.
Not anymore. The AWS AI services for natural language capabilities enable adding speech and text intelligence to applications using API calls rather than needing to develop and train models. NLU services provide the ability to convert speech to text with Amazon Transcribe (https://aws.amazon.com/transcribe/) or text to speech with Amazon Polly (https://aws.amazon.com/polly/). For NLP requirements, Amazon Textract (https://aws.amazon.com/textract/) enables applications to read and process handwritten and printed text from images and PDF documents, and with Amazon Comprehend (https://aws.amazon.com/comprehend/), applications can quickly analyze text and find insights and relationships with no prior ML training. For example, Assent, a supply chain data management company, used Amazon Textract to read forms, tables, and free-form text, and Amazon Comprehend to derive business-specific entities and values from the text. In this book, we will be walking you through how to use these services for some popular workflows. For more details, please refer to Chapter 4, Automating Document Processing Workflows.
In this section, we saw some examples of NLP's significance in solving real-world challenges, and what exactly it means. We understood that finding patterns in data can bring new meaning to light, and NLP models are very good at deriving these patterns. We then reviewed some technology challenges in NLP implementations and saw a brief overview of the AWS AI services. In the next section, we will introduce the AWS ML stack, and provide a brief overview of each of the layers.
Introducing the AWS ML stack
The AWS ML services and features are organized into three layers of the stack, keeping in mind that some developers and data scientists are expert ML practitioners who are comfortable working with ML frameworks, algorithms, and infrastructure to build, train, and deploy models.
For these experts, the bottom layer of the AWS ML stack offers powerful CPU and GPU compute instances (the https://aws.amazon.com/ec2/instance-types/p4/ instances offer the highest performance for ML training in the cloud today), support for major ML frameworks including TensorFlow, PyTorch, and MXNet, which customers can use to build models with Amazon SageMaker as a managed experience, or using deep learning AMIs and containers on Amazon EC2 instances.
You can see the three layers of the AWS ML stack in the next figure. For more details, please refer to https://aws.amazon.com/machine-learning/infrastructure/:
To make ML more accessible and expansive, at the middle layer of the stack, Amazon SageMaker is a fully managed ML platform that removes the undifferentiated heavy lifting at each step of the ML process. Launched in 2018, SageMaker is one of the fastest-growing services in AWS history and is built on Amazon's two decades of experience in building real-world ML applications. With SageMaker Studio, developers and data scientists have the first fully integrated development environment designed specifically for ML. To learn how to build ML models using Amazon SageMaker, refer to Julien Simon's book, Learn Amazon SageMaker, also published by Packt (https://www.packtpub.com/product/learn-amazon-sagemaker/9781800208919):
For customers who are not interested in dealing with models and training, at the top layer of the stack, the AWS AI services provide pre-trained models with easy integration by means of API endpoints for common ML use cases including speech, text, vision, recommendations, and anomaly detection:
Alright, it's time that we started getting technical. Now that we understand how cloud computing played a major role in bringing ML and AI to the mainstream and how adding NLP to your application can accelerate business outcomes, let's deep dive into the NLP services Amazon Textract for document analysis and Amazon Comprehend for advanced text analytics.
Ready? Let's go!!
In this chapter, we introduced NLP by tracing the origins of AI, how it evolved over the last few decades, and how the application of AI became mainstream with the significant advances made with ML algorithms. We reviewed some examples of these algorithms, along with an example of how they can be used. We then pivoted to AI trends and saw how AI adoption grew exponentially over the last few years and has become a key technology in accelerating enterprise business value.
We read a cool example of how ExxonMobil uses Alexa at their gas stations and delved into how AI was created to mimic human cognition, and the broad categories of their applicability, such as text, speech, and vision. We saw how AI in natural language has two main areas of usage NLU for voice-based uses and NLP for deriving insights from text.
In analyzing how enterprises are building NLP models today, we reviewed some of the common challenges and how to mitigate them, such as digitizing paper-based text, collecting data from disparate sources, and understanding patterns in data, and how resource-intensive these solutions can be.
We then reviewed NLP industry trends and market segmentation and saw with an example how important NLP was and still continues to be during the pandemic. We dove deep into the philosophy of NLP and realized it was all about converting text to numerical representations and understanding the underlying patterns to decipher new meanings. We looked at an example of this pattern with how SALB could impact the global economy.
Finally, we reviewed the technology implications in setting up NLP training and the associated challenges. We reviewed the three layers of the AWS ML stack and introduced AWS AI services that provided pre-built models and ready-made intelligence.
In the next chapter, we will introduce Amazon Textract, a fully managed ML service that can read both printed and handwritten text from images and PDFs without having to train or build models and can be used without the need for ML skills. We will cover the features of Amazon Textract, what its functions are, what business challenges it was created to solve, what types of user requirements it can be applied to, and how easy it is to integrate Amazon Textract with other AWS services such as AWS Lambda for building business applications.
- This Giant Beast that Is The Global Economy (https://www.amazon.com/This-Giant-Beast-Global-Economy/dp/B07MJDD22F)
- Learn Amazon SageMaker by Julien Simon (https://www.packtpub.com/product/learn-amazon-sagemaker/9781800208919)
- AWS CORD-19 Search: A Neural Search Engine for COVID-19 Literature (https://arxiv.org/abs/2007.09186)
- Transformers and Natural Language Processing (https://aws.amazon.com/blogs/machine-learning/aws-and-hugging-face-collaborate-to-simplify-and-accelerate-adoption-of-natural-language-processing-models/)
- Training NLP models with Amazon SageMaker's model parallelism library (https://aws.amazon.com/blogs/machine-learning/how-latent-space-used-the-amazon-sagemaker-model-parallelism-library-to-push-the-frontiers-of-large-scale-transformers/)
- Oppy, Graham and David Dowe, "The Turing Test", The Stanford Encyclopedia of Philosophy (Winter 2021 Edition), Edward N. Zalta (ed.), forthcoming URL (https://plato.stanford.edu/archives/win2021/entries/turing-test/)