Intelligent Document Processing with AWS AI and ML
It was a Wednesday evening – I was busy collecting all my receipts and filling out my insurance claim document. I wanted my health insurance to provide reimbursement for the COVID-19 test kits that I had purchased. The next day, I went to the post office to send the documents through postal mail to my insurance provider. This made me think how we are still working with physical documents in the 21st century. With my approximate math, this month alone, we will use 650 million documents per month, considering that 2% of the entire US population buys a test kit and applies for reimbursement using a paper-based application. This is a ton of documents in this instance. In addition to physical copies, we may have tons of documents that might just be scanned documents – we are looking at manual processing for these documents too. Can we do any better in the 21st century to automate the processing of these documents?
Besides this particular instance, we use documents for many other use cases across industries, such as claims processing in the insurance industry, loan, and mortgage documents in the financial industry, and legal and contract documents. If you have bought a house or refinanced a house, you will already be aware of the number of documents that you need to use for loan processing. IDC predicts worldwide data to exceed 175 zettabytes by 2025. The volume of data is huge. On top of the volume of data, we are talking about data of different formats and unstructured – some are forms, as with insurance claims, and some can be dense text, as with legal contractual documents. The volume and varying formats of documents make manual processing time-consuming, error-prone, and expensive. According to IDC, there is a 23% growth in data every year. The immense scale and format of documents make it a challenge to process them. Moreover, the legacy or traditional document extraction technologies can work well for pristine documents, but when document quality varies, the performance of those early-generation systems frequently does not meet customer needs. Manual document extraction carried out by a human workforce introduces variability into the process since people make mistakes and double-checking all work is not cost-effective. The most important of these factors is the ability to get the key information from the documents into your decision-making systems to make high-quality decisions more quickly and based on accurate information. Hence, we are all looking for efficient, less time-consuming, cost-effective ways to process our documents for better insights.
In this introductory chapter, we will be establishing the basic context to familiarize you with some of the underlying concepts of document processing, the challenges in document processing, and how AWS Artificial Intelligence (AI)/Machine Learning (ML) services can help solve these problems.
We will be covering the following topics in this chapter:
- Understanding common document processing use cases across industries
- Understanding the AWS ML and AI stack
- Introducing Intelligent Document Processing pipeline
Understanding common document processing use cases across industries
We started with a simple claims processing use case in the healthcare industry. But document processing challenges occur across multiple use cases and industries. For example, with a single patient generating nearly 80 megabytes of data each year in imaging and Electronic Medical Record (EMR) data, according to 2017 estimates, RBC Capital Markets projects that by 2025, the compound annual growth rate of data for healthcare will reach 36%. When a patient visits a physician, an immense amount of data is generated. Equally, when you speak with customers, they say they have petabytes of data in their archive, which is sitting there in a drive or tape drive without being processed further for legal or regulatory reasons, and most of it is unstructured data. For example, some healthcare providers in the US store medical history records for at least 7 years as per the regulation. If we can analyze a patient’s historical data, we can build a predictive model for any chronic disease. This data is a gold mine, but because of the lack of an efficient, cost-effective mechanism for document processing, it sits there unused. Most of this data is currently stored as archived data and retired after the 7-year period is over. Can we use this data to derive insights for better healthcare outcomes?
Similarly, in the financial industry, there is a need for document processing – for example, when processing mortgage documents. Anyone who has bought a new home or refinanced their home must know the number of documents and different document types that we deal with for mortgage processing. McKinsey’s report emphasizes that mortgage providers should get things right the first time to reduce any delay in processing. To address the timely verification of these documents, we need to empower loan officers with the right tools, automation, and insights. The immense volume and format of documents and the need to derive insights from them require automation with the right indexing, categorization, and extraction, with human reviews as needed to detect anomalies and get the mortgage documents right the first time for timely processing.
It is not only the healthcare or financial industries that require document processing but also industries across verticals and use cases such as legal documents and contracts, insurance, ID handling, and enrollments with the use of advanced technologies such as AI and ML, wants to automate document processing with advanced AI and ML technologies. Intelligent Document Processing uses AI-powered automation and ML to classify, extract, transform, and enrich our documents for consumption. Before discussing advanced technologies and solutions, it is always good to start with the basics. So, let’s first set the foundation of AI and ML.
Understanding the AWS ML and AI stack
Just five decades ago, ML was still a thing of science fiction. But today, it is proven to be an integral part of our everyday lives. It helps us drive our cars, recommends personalized shopping experiences, and helps us utilize voice-enabled technologies such as Alexa. The early days of AI and ML began with simple calculators or chessboard games but by the 20th century, this has evolved into diagnosing cancer and more. The initial theory of ML was in research and labs and now it has moved from labs to real lives applications across industries. This is a change in the adoption of AI and ML.
Figure 1.1 – AI and ML
What is AI? AI is a wide range of computer science branches related to building smart machines. And ML is a subset or application of AI, as shown in Figure 1.1. The goal of ML is to let the machine learn automatically without any programming or human assistance. We want the machine to learn from its own experience and provide results. You gather data and the model learns and corrects itself based on this data. One of the famous historical achievements of AI or ML is Alan Turing’s paper and the subsequent development of the Turing Test in the 1950s. This established the fundamental goal and vision for AI. This focused on one main thing – can machines learn like humans? After 2 years, Arthur Samuels, another pioneer in the computer science and gaming industry, wrote the very first computer learning program for playing the game checkers. It was programmed to learn from the moves that allowed it to win and then program itself to play the game. With some of the recent AI and ML accomplishments, in the year 2015, AWS launched its own ML platform to make its models and ML infrastructure more accessible.
Now, we see AI and ML in our everyday usage. If you have used any e-commerce or online media or entertainment platforms, you must be familiar with receiving personalized recommendations or using conversational chatbots and virtual assistance with AI services. These personalized recommendations and experiences drive user engagement. Similarly, any helpdesk calls at contact centers can be automated with AI, driven to reduce the burden on human beings with reduced costs. Moreover, AI can be used in automatic document processing for accurate extraction and analysis and to instantly derive insights from it, as in loan processing or claims processing.
Now, we see a wide presence of ML and AI in our everyday usage and industries are busy building newer models to learn better and more quickly to give accurate predictions and accelerate business value. But the main question is – can we share the experience and knowledge that we learned when building models? Can a builder re-use an already trained model for its own business without spending time and effort to train another model? So, can we share our experience and knowledge and ML models for any builder to use and focus on their business needs?
The answer is yes, and for that reason, AWS has divided its ML stack into three broad categories. Let’s discuss the three individual AI/ML stacks in detail and their core goals in solving user requirements in the following figure:
Figure 1.2 – The ML framework and infrastructure at the bottom of the AWS stack
At the bottom of the AWS AI or ML stack, we see services and features targeted at expert ML practitioners who have the expertise and are comfortable working with ML frameworks, algorithms, and deploying their ML infrastructure. Some of the AWS ML frameworks and infrastructure are shown in Figure 1.2. AWS offers users their framework of choice, thus supporting ML frameworks such as PyTorch, Apache MxNet, and TensorFlow to run optimally on the AWS platform. The bottom layer also stacks CPU and GPU instances. Decades ago, obtaining GPU resources to accelerate your ML workload was a wild dream for general ML builders. You might have to reach out to a supercomputing center to get ahold of GPU resources. But today, you can access GPUs at your fingertips with AWS. AWS gives users the option to customize and select instances with customized memory, vCPU, architectures, and more. AWS added Trainium, a second ML chip optimized for deep learning training.
Not only that, but to democratize the ML infrastructure, AWS offers Inferentia to drive high-performance deep learning inference on the cloud at a fraction of the cost:
Figure 1.3 – ML services in the middle of the AWS stack
The middle layer in the AI or ML stack is more targeted toward an ML builder who wants to build, train, and deploy their own ML models. Some of the AWS offerings for ML services are shown in Figure 1.3. This layer makes ML more accessible and expansive. Amazon SageMaker helps data scientists and developers prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML. Amazon SageMaker offers JumpStart to help you quickly get started with a solution by automatically extracting, processing, and analyzing documents for accelerated and accurate business outcomes. It offers an integrated Jupyter notebook for authoring your model with pre-built optimized algorithms. But at the same time, it gives options to ML users to bring their own algorithms and frameworks. It offers a managed, scalable, and secure training and deployment platform for your ML process. To learn more about Amazon SageMaker, you can also refer to the book The Machine Learning Solutions Architect written by David Ping and published by Packt.
You can find this book here: https://www.packtpub.com/product/the-machine-learning-solutions-architect-handbook/9781801072168.
Figure 1.4 – AI services at the top of the AWS stack
AWS designed the top layer to put ML in the hands of every single developer. These are AI services. AWS drew on its experience with Amazon.com and its ML services to offer highly accurate, API-driven AI services. You do not need to be an ML expert to call on the pre-trained models leveraging APIs. Rather, you can use AI services to enhance your customer experience, improve productivity, and get a faster time to market with ready-made ML models. At the core of the AI services, we have Vision services, with Amazon Rekognition and AWS Panorama. For Speech, we have services such as Amazon Polly, Amazon Transcribe, and Call Analytics; and for chatbots, Amazon Lex. You can leverage these speech and bot services for use cases such as call center modernization. For leveraging the experience of Amazon.com on a recommendation system, it offers Amazon Personalize. In this book, we will dive deep into the document processing use cases with its text and document services such as Amazon Textract and Amazon Comprehend. To help the customer with industry-specific use cases, AWS AI services are also categorized in terms of industrial use, with AWS AI services such as Monitron and Lookout, and healthcare technologies, with AI services such as Amazon Comprehend Medical, HealthLake, and Transcribe Medical. In Figure 1.4 here, we are showing how AI services can be aligned to specific industry use cases. But in this book, we will dive deeper into IDP use cases in particular.
Some of the main benefits of AWS AI services are that the models are fully managed and AWS takes care of the undifferentiated heavy lifting in building, maintaining, patching, or upgrading servers or hardware required for the model(s) to run. You can customize and interact with the AI models and perform predictions via API calls or directly from the AWS console. AWS AI services enable you to have performant and scalable solutions with serverless technologies, which can be called using these AI service APIs. You can just call APIs using a serverless architecture that scales automatically as the document processing demand grows or shrinks. This is highly performant, with low latency and timely delivery of your business use case:
Figure 1.5 – Accessing AWS AI services with an API call
With AWS AI or ML offerings, we have multiple technologies available to implement the same use case. There are trade-offs when using AI services that are API-driven versus ML services. We will dive deeper into the comparison of AI and ML models for IDP in Chapter 3, Accurate Document Extraction with Amazon Textract, under the Introduction to Textract section.
Alright, it’s time to get started with an overview of IDP. Now that we understand how AWS cloud infrastructure and services will help us accelerate our AI or ML workload, let’s dive into the IDP pipeline and its applications across industries.
Introducing Intelligent Document Processing pipeline
IDP seems simple but in reality, it is a complex challenge to solve. Imagine a physical library – racks and racks of books divided and arranged in rows tagged with the right author and genre. Have you wondered about the human workforce behind doing this diligent, structured work to help us find the right book in a timely and efficient manner?
Similarly, as you know, we deal with documents across industries for various use cases. In the traditional world, you would need many teams to go through the entire list of documents and manually read documents individually. Then, they would need to identify the category the document belongs to and tag it with the right keywords and topics so that it can be easily identifiable or searchable. Following the process, your main goal is to extract insights from these documents. This is a massive process and takes months and years to set up based on the volume of the data and the skill level of the manual workforce. Manual operations can be time-consuming, error-prone, and expensive. To onboard a new document type and update or remove a document type, these steps need to be followed incrementally. This is a significant investment, effort, and a lot of pressure on the manual workforce. Sometimes, the time and effort needed are not budgeted for and can cause significant delays or pause the process. To automate this, we need digitization and digitalization.
Digitization is the conversion of data to digital format. Digitization has to happen before digitalization. Digitalization is the process of using these digital copies to derive context-based insights and transform the process. After transformation, there must be a way to consume the information. This entire process is known as the IDP pipeline. Go through the following in Figure 1.6 to get a detailed view of the IDP pipeline and its stages:
Figure 1.6 – The IDP pipeline and its stages
In our library books example, we can go to a library directly, look for books, borrow a book, return a book, or just sit and read a book in the library. We want a place where all books are available, well-organized, easily accessible when we need them, and affordable. Similarly, at the data capture stage, documents are similar to our library books. During this stage, we collect and aggregate all our data in a secure, centralized, scalable data store. While building the data capture stage for your IDP pipeline, you have to take data sources, data formats, and the data store into consideration:
- Document sources: Data can come from various sources. It can be as simple as mobile capture, such as submitting receipts for reimbursement or submitting digital pictures of all your applications, transcripts, ID documents, and supporting documents during any registration process. Other sources can be simple fax or mail attachments.
- Document format: The data we speak about comes in different formats and sizes. Some can just be a single page, such as a driver’s license or insurance card, but others can be multiple pages, such as in a loan mortgage application or with insurance benefit summary documents. But we categorize data into three broad categories: structured, semi-structured, and unstructured. Structured documents have structured elements in them, such as table-type elements. Unstructured documents have dense text, as in legal and contractual documents. Finally, semi-structured documents contain key-value elements, as in an insurance application form. But most often documents can have multiple pages with all the different category (structured, semi-structured, and unstructured) elements in them. There are also different types of digital documents – some can be image-based, with JPEGs and PNGs, and others can be PDF or TIFF types of documents with varying resolutions and printing angles.
- Document store: To store the untransformed and transformed documents, we need a secure data store. At times, we have a requirement to store metadata about documents, such as the author or date, which is not mentioned in the document, for future mapping of metadata to extraction results. Industries such as healthcare, the public sector, and finance should be able to store their documents and their results securely, following their security, governance, and compliance requirements. For easier, instantaneous, and highly performant access, they need storage with industry-leading easy-to-use management and simpler access from anywhere at the click of a button. The volume of data and documents is vast. To support it, we require a scalable data store, which can scale as per our needs. Another important factor is the high reliability and availability of your data store so that you can access it whenever you have a need. Moreover, given the high volume of documents, we are looking for a cost-effective document store.
Let’s now move on to the next IDP phase.
Going back to our book library example, the books are categorized and stacked by category. For example, for any fiction or non-fiction books, you can directly check the label on the rack and go to the section where you can find the books related to that category. Each section can be further subdivided into sub-sections or can be arranged by the first letter of the author’s name. Can you imagine how difficult it would be to locate a book if it were not categorized correctly?
Similarly, at times, you receive a package of documents, or sometimes a single PDF, with all the required documents merged. A human can preview the documents to categorize them into their specified folder. This helps later with information extraction and metadata extraction from a variety of complex documents, depending on the document type. This process of categorizing the documents is known as document classification or a document splitter.
This process is crucial when we try to automate our document extraction process and when we receive multiple documents and don’t have a clear way to identify each document type. If you are dealing with a single document type or have an identifiable way to locate the document, then you can skip this step in the IDP pipeline. Otherwise, classify those documents correctly before proceeding in the IDP pipeline.
Again, analogous to our library books, now that all the books are accurately categorized and stacked, we can easily find a book of our choice. When we read a book, we might come across multiple different formats of text, such as dense paragraphs interweaved between tables and some structured or semi-structured elements such as key values. As human beings, we can read and process that information. Human beings know how to read a table or key-value types of elements in addition to a paragraph of text. Can we automate this step? Can we ask a machine to do the extraction for us?
The process of accurately extracting all elements, including structural elements, is broadly known as document extraction in the IDP pipeline. This helps us to identify key information from documents through extensive, accurate extraction. The intelligent capture of the data elements from documents during the extraction phase of the IDP pipeline helps us derive new insights in an automated manner.
Some of the examples of the extraction stage include Named Entity Recognition (NER) for automatically identifying entities in unstructured data. We will look into the details more deeply in Chapter 4, Accurate Extraction with Amazon Comprehend.
To get insights and business value out of your document, you will need to understand the dynamic topics and document attributes in your document. During the document enrichment stage, you append or enhance the existing data with additional business or domain-specific context from additional sources.
For example, while processing healthcare claims, at times, we need to refer to a doctor’s note to verify the medical condition mentioned in the claims form. Additional documents such as doctor’s notes are requested for further processing. We get a raw doctor’s note deriving medical information such as details about medication and medical conditions – being able to get this directly from the main document is critical to enable business value such as improving patient care. To achieve this, we need the medical context, metadata, attributes, and domain-specific knowledge. This is an example of the enrichment stage of the IDP pipeline.
While entity recognition can extract metadata from texts of various document types, we need a process to recognize the non-text elements in our documents. This is where the object detection feature comes in handy. This can be extended further into identifying personal information with Personally Identifiable Information (PII) and Protected Health Information (PHI) detection methods. We can also de-identify our PII or PHI for further downstream processing. We will look into the details in Chapter 5, Document Enrichment in Intelligent Document Processing.
Document post-processing (review and verification)
Going back to our book library example, there are certain instances when a library gets a new book and places the book in a new book section instead of categorizing it by genre. These are some specific rules that we follow for certain books in that library. Some specific rules and post-processing are required to organize our books in the library.
Similarly, with document processing, you might want to use your business rules or domain-specific validation to check for its completeness. For example, during a claims processing pipeline in the insurance industry, you want to validate for insurer ID and additional basic information. This is to check for the completeness of the claims form. This is a type of post-processing in the IDP pipeline.
Additionally, the extraction process or the enrichment steps previously discussed may not be able to give you the accuracy required for your business needs. You may want to include a human workforce for manual review for higher accuracy. Having a human being review some or certain fields of your documents in an automated way for higher accuracy can also be a part of the post-processing phase in IDP. Human review can be expensive, so in an automated manner, we will only process limited data on our documents in this way as per our business needs and requirements. We will further discuss this phase in Chapter 6, Review and Verification of Intelligent Document Processing.
In our book library example, we always wish for a centralized, unified portal to track all our library books and their statuses. To maintain a digital library or online library system, nowadays, libraries support online catalogs where you can check all the books in a library in a centralized portal and their reservation statuses, the number of copies, and additional book information about its author and ratings. This is an example where we are to not only maintain and organize a book library but also integrate library information with our portals. We might maintain multiple different portals or tracking systems to manage and maintain our library books. This is an example of the integration and consumption stage for our library books.
Similarly, in our IDP pipeline, we collect our documents and categorize them during the data capture and classification stages. Then, we accurately extract all the required information from our documents. With the enrichment stage, we derived additional information and transformed our data for our business needs. Now is the time to consume the information for our business requirements. We need to integrate with our existing systems to consume the information and insights derived from our documents. Most of the time, I come across customers already using an existing portal or tracking system and wanting to integrate the insights derived from their documents with the existing system. This will also help them to build a 360 view of their product from the consumer perspective. At other times, the customer wants just a data dump in their database for better, faster queries. There can be many different ways and channels you want to use to consume the extracted information. This stage is known as the consumption or integration stage in our IDP pipeline.
Let’s now summarize the chapter.
In this chapter, we discussed the current challenges in document processing and how IDP can help overcome those challenges. We introduced IDP by tracing the origins of AI, how it has evolved over the last few decades, and how AI became an integral part of our everyday lives.
We then reviewed industry trends and market segmentation and saw with examples how important it is to automate document processing. We also discussed IDP across industry use cases. We read an example of how patient data can be collected and enriched to better patient outcome prediction.
Finally, we reviewed the stages of the IDP pipeline such as data capture, data classification, data extraction, data enrichment, and data post-processing. This chapter gave readers an understanding of IDP and the various stages involved to automate the end-to-end pipeline.
In the next chapter, we will go through the details of the data capture stage and document classification with AWS AI services. We will also look into the details of AWS AI services such as Amazon Comprehend custom classification and Amazon Rekognition for document classification.