RAG from First Principles

Data Import

Whether you are building a RAG system with low-code tools like Coze or Dify, or coding with open-source frameworks such as LangChain or LlamaIndex, parsing files and reading their content is the crucial first step in the entire embedding, retrieval, and generation (RAG) process. If the system cannot accurately recognize and parse files, it will not be able to construct a knowledge base for the corresponding domain, and subsequent RAG tasks will not be possible.

Figure 1.1: Two people sit at a desk discussing RAG systems

Does the parsing process depend on file type?

In enterprise environments, file types are often diverse. Not only are there common formats such as PDF, .doc/.docx, etc., but there may also be uncommon or proprietary formats. To handle this complexity, it is necessary to establish clear parsing strategies for each file type and select suitable tools. Priority should be given to using universal parsing libraries or tools to cover common file formats. For example, the Unstructured tool can handle multiple file formats in the same directory in a unified manner.

Figure 1.2: The document import process using LangChain and LlamaIndex with various file types

For common and important file types (such as PDF), you can choose specialized tools (such as PyPDF2, PyMuPDF, Marker, MinerU, and PDFPlumber) to parse and extract the required content. The specific choice depends on the application scenario (such as structured data extraction, text content search, OCR, or image extraction, etc.).

For less common file types, you need to find out the tool that generated the file, its data structure, and its purpose. For example, if a directory contains a product_sales.jsonl file, you might infer from the name that it is a product sales log in JSON Lines format, which is a text file format where each line contains a single independent JSON object without commas or other delimiters between lines. This must be considered during parsing.

For complex documents, you can first convert them to an intermediate format that is easier to parse. For instance, you can convert PPT files to PDF format before extracting information (since PDF extraction tools are more plentiful and offer more options). Similarly, sometimes it is best to first convert PDF files uniformly into Markdown documents and then process them, because the Markdown format is more standardized and thus makes it easier to extract headings, paragraphs, and lists as pure text. Moreover, since large models are trained on web resources, they are naturally more familiar with Markdown format, which is more conducive to their understanding.

In addition, large models can also be used to assist in parsing structured, semi-structured, and unstructured data, especially in certain complex scenarios (such as multimodal situations), where they can achieve very good results.

In the next section, let’s delve into how to parse various types of documents into plain text and import them into the RAG system.

Reading simple text with a DataLoader

Let’s start by explaining how to read documents with the simplest and most intuitive TXT file format.

Figure 1.3: A text file in Notepad describes game story chapters and cultural elements in detail

Frameworks such as LangChain and LlamaIndex offer various data loaders to parse documents into data objects of specific formats. If you do not wish to use these frameworks, you can choose standalone document parsing tools like Unstructured.

Using LangChain to read TXT files and generate document objects

LangChain’s TextLoader class can be used to read TXT files and parse them into LangChain Document data objects. First, ensure that the langchain, langchain-core, and langchain_community packages are installed. The following code can then be used to read a TXT file using LangChain (refer to https://github.com/PacktPublishing/RAG-from-First-Principles for the complete code):

from langchain_community.document_loaders import TextLoader
loader = TextLoader("data/black myth/ The setting of Black Myth Wukong.txt")
documents = loader.load()
print(documents)

[Document(metadata={'source': 'data/black myth/The setting of Black Myth Wukong.txt'}, page_content='The story of "Black Myth: Wukong" can be divided into 6 sections, namely "Fire Illuminates Black Clouds", "Wind Rises at Dusk", "Night Gives Birth to White Dew", "Curved Purple Mandarin Ducks", "Sunset in the Mortal World", and "Unfinished", and features two endings. The player's choices and experiences will affect the final outcome.\nAt the end of each section, two-dimensional and three-dimensional animated cutscenes are attached, showcasing and exploring the narrative and thematic elements in "Black Myth: Wukong".\nThe game's setting blends Chinese culture and natural landmarks. For example, the Dazu Rock Carvings in Chongqing, Xiaoxitian, Nanchan Temple, Tiebusi Temple, Guangsheng Temple, and the Stork Tower in Shanxi Province, etc., all appear in the game. The game also incorporates philosophical elements of Buddhism and Taoism.')]

In LangChain, the Document object is a core data structure that represents text content loaded from external files or other data sources. As you can see in the preceding code, the Document object primarily contains the following two attributes:

metadata: Stores metadata related to the document, such as the document’s source path, author, date, etc.
page_content: Stores the actual text content, which is the main data part of the document

Alex: Why do we need to use this Document object?

Lewis: First of all, the metadata recorded in the Document object is very important. Although this metadata is not the actual content of the document, it contains rich information and plays an important role in RAG systems. The data of a document may come from various formats, such as TXT, PDF, HTML, or database records. Relying on the raw string alone makes it impossible to track its source or obtain related additional information (such as date, category, etc.). In many natural language processing or information retrieval tasks, combining metadata for filtering, sorting, or analysis is an essential step. In some advanced indexing techniques, metadata can also be used to store summaries of the text content, associated parent document IDs, etc.

Moreover, since different data sources have their own characteristics (such as chunking strategies, paragraph breaks, etc.), it is necessary to perform abstraction and standardization through a unified data structure (like the Document object). The Document object can abstract diverse data sources into a unified and structured form, making it convenient for seamless processing within LangChain, ensuring that documents can be smoothly delivered to embedding models, classifiers, or Q&A systems.

The following code example demonstrates how to directly create a LangChain Document object:

from langchain_core.documents import Document
documents = [ Document( page_content="Black clouds lit by fire", metadata={"source": "scene_list.txt"}, ),
Document( page_content="Wind rises at dusk", metadata={"source": "scene_list.txt "}, ), ]

Data loaders in LangChain

LangChain’s data import tools are not limited to just TextLoader. For example, CSVLoader can load data in CSV table format, JSONLoader can import JSON files, while PyPDFLoader or PyMuPDFLoader can be used to parse PDF files.

It is worth noting that for the same document format, LangChain may provide several different loaders. Take PDF files as an example; LangChain offers more than 10 different loaders, which can sometimes be overwhelming and make it hard to determine which one is the most suitable.

Figure 1.4: A code editor shows a Python dictionary mapping loader names to module paths for document loaders

Although LangChain’s vast ecosystem adds to its complexity, which is a point it is often criticized for, it also gives LangChain powerful document processing capability, enabling it to easily handle a wide range of data sources and support complex natural language processing workflows.

In LangChain’s official documentation, we can find detailed explanations of loaders for common file types.

Figure 1.5: A LangChain web page shows a table listing document loaders and their supported file types

Using LangChain to read all files in a directory

Usually, you may want to read all different types of files in a directory at once and convert them into Document objects for unified management. In LangChain, this can be achieved using the DirectoryLoader (directory file loader).

Figure 1.6: A folder named contains files in formats like txt, md, jpg, csv, pdf, and settings.txt

To use DirectoryLoader, you can follow these steps:

Install the unstructured package using the following commands:

pip install unstructured
pip install 'unstructured[image]'
pip install 'unstructured[md]'
sudo apt-get install tesseract-ocr # Here we take Ubuntu as an example
pip install pytesseract

The following code shows how to use DirectoryLoader to load various types of files from a specified directory and generate an object for each document (refer to https://github.com/PacktPublishing/RAG-from-First-Principles for the complete code):

from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader('./data/black myth')
docs = loader.load()
print(f'Number of documents: {len(docs)}')  # Print the total number of documents

Number of documents: 7

Specific file types in a particular directory can be loaded by specifying the file path and file matching pattern (such as wildcards)
```
loader = DirectoryLoader('./', glob='**/*.md') # Load only Markdown files in the directory
```
If you want to see a progress bar during the loading process, you can install the tqdm library and enable the show_progress parameter
```
loader = DirectoryLoader('./', show_progress=True) # Display a progress bar during loading
```
By default, DirectoryLoader uses a single thread to load files. To increase loading speed, you can enable multithreading:
```
loader = DirectoryLoader('./', use_multithreading=True) # Enable multithreading for document loading
```

Under the hood, DirectoryLoader uses UnstructuredLoader by default (this is LangChain’s integration of the document parsing tool unstructured) to load and parse files. However, we can also specify other loaders through the loader_cls parameter. For example, in the following code we use TextLoader to load TXT files (including Markdown and similar files):

from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader
loader = DirectoryLoader('data/Black Myth',
glob='**/*.md',
loader_cls=TextLoader # Specify a particular loader
)
docs = loader.load()
print(docs[0].page_content[:100])  # Print the first 100 characters of the first document's content

TextLoader preserves the formatting of headings when processing Markdown files, whereas UnstructuredLoader does not retain this formatting when processing. The following figure illustrates this difference:

Figure 1.7: Image compares UnstructuredLoader and TextLoader results for text formatting with emojis

The difference shown here is mainly to illustrate that different loaders may produce different results when parsing the same type of file. This does not mean that TextLoader is superior to UnstructuredLoader, or vice versa. In fact, each loader has its own appropriate scenarios: TextLoader is more suitable for loading pure text files with simple structure, including but not limited to Markdown files; while UnstructuredLoader has broader applicability, is suitable for more types of file formats, and is able to extract richer structured information during processing.

If you attempt to use TextLoader to import all types of files without specifying the file type, it will throw an error when it encounters a file type it does not support
```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
```

To prevent the program from being interrupted by these errors, you can set the parameter silent_errors=True to allow the loader to skip files that cannot be loaded and continue processing the remaining files:

loader = DirectoryLoader(''data/black myth'',
silent_errors=True, # Skip files that cannot be loaded
loader_cls=TextLoader
)
Error loading file data/black myth/black myth wukong.pdf
Error loading file /data/black myth/Black Myth in English.jpg

With this configuration, when TextLoader attempts to load all files in the directory, it will skip unsupported file types such as PDFs or images and log the error message instead of raising an exception that interrupts program execution.

Using LlamaIndex to read all documents in a directory

Similar to LangChain, LlamaIndex provides powerful tools to load documents from a directory and parse these documents into LlamaIndex’s Document objects. In LlamaIndex, these kinds of tools are known as data connectors or readers.

One simple and easy-to-use Reader is SimpleDirectoryReader. It can load various types of files from a specified directory, including Markdown, PDF, PPT, Word, and audio/video files.

The following code gives the number of documents in the directory:

from llama_index.core import SimpleDirectoryReader
dir_reader = SimpleDirectoryReader('data/black myth')
documents = dir_reader.load_data()
print(f'Number of documents: {len(documents)}')

Number of documents: 11

Next, let’s print one of the Document objects to see its overall structure:

print(documents[1])

Document (id_='d48c275b-c62b-450b-a575-a6ff45ca9a91', embedding=None, metadata={'file_path': '/home/huangjia/Documents/08_RAG/Book2411/rag_240917/data/black myth/ Black Myth Version Introduction.md', 'file_name': 'Black Myth Version Introduction.md', 'file_type': 'text/markdown', 'file_size': 1418, 'creation_date': '2024-11-26', 'last_modified_date': '2024-11-26'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='\n\nBlack Myth: Wukong \n\n> Black Myth: Wukong is a highly anticipated action-adventure game developed by a Chinese game development team. Based on Journey to the West, it reinterprets the classic story and delivers an impactful visual and gameplay experience.\n', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

You can see that the metadata generated in the Document object by LlamaIndex is richer than that generated by LangChain, including file path, file type, file size, creation date, modification date, etc. This metadata provides more contextual support for document management and subsequent analysis. In addition, LlamaIndex also offers the excluded_embed_metadata_keys and excluded_llm_metadata_keys options, which can be used to specify which metadata should not be included in information embedding or large model processing. This is particularly useful when you need to streamline the context or improve retrieval efficiency.

Overall, LlamaIndex excels in the structuring and granular management of Document objects, meeting the needs of enterprises for handling diverse and complex data.

Alex: I also noticed that when importing the same directory using LlamaIndex, the number of generated Document objects is higher than that generated by LangChain.

Lewis: Yes. By default, LangChain generates one Document object for each original file when importing, without performing chunking. LlamaIndex, in principle, also works this way. However, for certain specific file types, such as CSV files, LlamaIndex will automatically split them into multiple parts, with each part treated as an independent Document object. This means that during the import process, LlamaIndex performs chunking for CSV files, resulting in a higher number of Document objects.

You can read specific files in a directory using the following code (refer to https://github.com/PacktPublishing/RAG-from-First-Principles for the complete code):

file_reader = SimpleDirectoryReader(input_files=['data/black myth/The setting of Black Myth Wukong.txt'])
documents = file_reader.load_data()

The following code demonstrates how to generate a LlamaIndex Document object directly and manually add metadata:

from llama_index.core import Document
documents = [
Document(
text='An underground cave filled with flames and the scent of sulfur, where fire jets continuously erupt from below, illuminating the whole abyss. Wukong must use his jumping ability and golden staff to make his way through the lava.',
metadata={
'filename': 'Blazing Abyss.md',
'category': 'Game Scene',
'author': 'Ka Ge AI',
'creation_date': '2024-11-20',
},
), ... ]

Now that you know how to load various types of files from a specific directory, the next step is to read the data from the files.

Connect learner with LlamaHub and read database entries

For file types that cannot be processed by SimpleDirectoryReader, LlamaIndex supports downloading and installing more advanced Readers through LlamaHub.

Next, we introduce the MySQL database Reader as an example. Before usage, you should first install the Database Reader connector.

First, execute the following commands to perform the necessary installation:

pip install llama-index-readers-database
sudo apt-get install libmysqlclient-dev
sudo apt-get install python3-dev
pip install mysqlclient

Then use the following code to load data from a MySQL database:

from llama_index.readers.database import DatabaseReader
reader = DatabaseReader(
scheme='mysql',
host='localhost',
port=3306,
user='username',
password='password',
dbname='example_db'
)
query = 'SELECT * FROM game_scenes' # Select all game scenes
documents = reader.load_data(query=query)
print(f'Number of documents loaded from the database: {len(documents)}')
print(documents)

[Document(id_='43594ec8-2751-496f-b0eb-dbfa183d20a4', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='id: 1, scene_name: Zhu Jia Village, description: The first village where the game starts, full of strong ancient Chinese rural flavor, region: Eastern Plains, environment_type: village, main_enemies: Bandits, monster minions, special_features: Important NPC blacksmith shop, weapons can be upgraded', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}'),  ... ]

This approach can directly convert the database query results into Document objects, while defining the {key}: {value} data structure pattern in the metadata_template field.

Alex, you can spend some time reading the official documentation on LlamaHub at https://developers.llamaindex.ai/python/framework/module_guides/loading/connector/.

Reading various types of documents with the unstructured tool

If you prefer not to use any framework and want to build your own RAG system from scratch, using the Unstructured tool to read various types of documents is a good choice. Unstructured is an open-source document processing tool specifically designed to support parsing multiple types of documents, and it can effectively preserve the original structural information of documents during processing.

A blue sign with the word unstructured on it.

Description generated by AI

Figure 1.8: The Unstructured tool

Similar to LangChain and LlamaIndex, after importing documents using the Unstructured tool, it generates a unique type of data object called Element.

First, use the partition_text function to examine the process of reading a text file:

from unstructured.partition.text import partition_text
text = 'data/black myth/The setting of Black Myth Wukong.txt'
elements = partition_text(text)
for element in elements:
    print(element)

In fact, the partition_text function is part of the underlying logic of the LangChain directory loader.

Next, let’s see what details are contained in the generated elements data objects:

for i, element in enumerate(elements):
    print(f"\n--- Element {i+1} ---")
    print(f"Element type: {element.__class__.__name__}")
    print(f"Text content: {element.text}")
    if hasattr(element, 'metadata'):
        print('Metadata:')
        metadata = vars(element.metadata)
        valid_metadata = {k: v for k, v in metadata.items()
                          if not k.startswith('_') and v is not None}
        for key, value in valid_metadata.items():
            print(f"  {key}: {value}")

The story of “Black Myth: Wukong” can be divided into 6 sections, named ‘‘Fire Illuminates Black Clouds’’, ‘‘Wind Rises at Dusk’’, ‘‘Night Gives Birth to White Dew’’, ‘‘Twisted Purple Mandarins’’, ‘‘Sunset in the Mortal World’’, and ‘‘Unfinished’’, and has two possible endings. The learner’s choices and experiences will influence the ultimate ending. At the end of each section, there are 2D and 3D animated cutscenes that present and explore the narrative and thematic elements of “Black Myth: Wukong”.

In this way, the Unstructured tool not only imports the document but also processes it in blocks according to specific rules.

By using the partition function, you can automatically read files of any type:

from unstructured.partition.auto import partition
filename = 'data/black myth/black myth wukong.pdf'
elements = partition(
filename=filename,
content_type='application/pdf'
)
print('\n\n'.join([str(el) for el in elements][:10]))

Although the partition function is generic and applicable to a variety of file types, its functionality is relatively simple when processing specific files. In contrast, specialized functions like partition_html and partition_pdf can showcase more distinctive features and advantages when dealing with their respective document types.

Although the Unstructured tool is powerful, it is not the only choice. In actual projects, we often need to choose different tools based on specific requirements. For example, to process PDF files, PyMuPDF is also a common choice (of course, there are many other options as well). Before using PyMuPDF, please install it first by using the following command:

pip install pymupdf

The following code example demonstrates how to carry out the operation:

import pymupdf
doc = pymupdf.open(''data/black myth/black myth wukong.pdf'')
text = [page.get_text() for page in doc]

You can compare the differences in the text formats parsed by different tools.

This concludes the introduction to simple text reading. Although the content introduced in these sections is basic, it covers a variety of technical aspects. The key point is that each tool will generate its own structured data objects, which contain a set of metadata. Hopefully, you can practice these hands-on to deepen your understanding of the various tools.

Parsing specific elements with the JSON loader

Choosing the appropriate loader for specific file types can improve the efficiency of data processing. In this section, we will explore the usage and characteristics of the JSON loader in LangChain.

First, let’s look at a JSON file containing rich data result information:

Figure 1.9: A JSON file for a Journey to the West game showing main and support character details

If we use TextLoader to load a JSON file, the input and output will be as follows:

from langchain_community.document_loaders import TextLoader
text_loader = TextLoader("data/Journey_to_the_West_Characters.json")
text_documents = text_loader.load()
print(text_documents)

[Document(metadata={'source': 'data/Journey_to_the_West_Characters.json'}, page_content='{
  "gameTitle": "Journey to the West",
  "basicInfo": {
    "engine": "Unreal Engine 5",
    "releaseDate": "2024-08-20",
    "genre": "Action Role-Playing",
    "platforms": ["PC", "PS5", "Xbox Series X/S"],
    "supportedLanguages": ["Simplified Chinese", "Traditional Chinese"]
  },
  "mainCharacter": {
    "name": "Sun Wukong",
    "backstory": "At the dawn of chaos... Sun Wukong.",
    "abilities": ["Seventy-Two Transformations", "Golden Hoop Staff", "Cloud Somersault", "Fire Eyes Golden Gaze"],
    "supportCharacters": [
      {
        "name": "White Dragon Horse",
        "identity": "One of the Eight Heavenly Dragons",
        "background": "Originally the Third Prince of the West Sea Dragon King...",
        "abilities": ["Water Escape", "Riding Clouds and Mist", "Transformation"]
      },
      {
        "name": "Red Boy",
        "identity": "Holy Infant King",
        "background": "Son of Bull Demon King and Princess Iron Fan...",
        "abilities": ["Samadhi True Fire", "Fire Eyes", "Combat Form"]
      },
      {
        "name": "Six-Eared Macaque",
        "identity": "Clone of Sun Wukong",
        "background": "A mysterious figure matching the Monkey King's abilities.",
        "abilities": ["Imitation", "Stealth", "Speed"]
      }
    ]
  }
}')]

From the above code, you can see that TextLoader reads the JSON file as plain text. This means the entire JSON content is stored as a string in the page_content field. If you want to use specific field values from page_content, you still need to further parse this string.

If you use JSONLoader, you can directly extract specific elements from the JSON file through jq query syntax.

First, we need to install the necessary library jq, which is a lightweight JSON processing tool suitable for parsing, manipulating, and formatting JSON data:

pip install jq

The following code example demonstrates how to use JSONLoader to parse a JSON file (refer to https://github.com/PacktPublishing/RAG-from-First-Principles for the complete code):

from langchain_community.document_loaders import JSONLoader
## Extract and Print Main Character Information
print("Main character information:")
main_loader = JSONLoader(
file_path="data/black myth/black mythpersona.json",
jq_schema='.mainCharacter | "Name:" + .name + ",Backstory:" + .backstory',
text_content=True
)
main_char = main_loader.load()
print(main_char)
## Extract and Print Supporting Character Information
print("\nSupporting character information:")
support_loader = JSONLoader(
file_path="data/black myth/black mythpersona.json",
jq_schema='.supportCharacters[] | "Name:" + .name + ",Background:" + .background',
text_content=True
)
support_chars = support_loader.load()
print(support_chars)

Now let’s look at an example where we split characters into document objects:

Main character information:
[Document(metadata={'source': '/journey_to_the_west_persona.json', 'seq_num': 1}, page_content='Name: Sun Wukong......')]
Supporting character information:
[Document(metadata={'source': ' journey_to_the_west_persona.json', 'seq_num': 1}, page_content='Name: White Dragon Horse......')
Document(metadata={'source': ' journey_to_the_west_persona.json', 'seq_num': 2}, page_content='Name: Red Boy......', ......)]

As you can see, JSONLoader can split each character into multiple Document objects and assign them numbers using seq_num. Each Document object not only contains the original document’s metadata (such as source file name), but also parses the internal data structure of the document, i.e., the specific field information.

Reading text from images

Lewis: In many real-world AI applications, valuable information is embedded not only in plain text files but also in images, scanned documents, presentations, and PDFs. Let’s explore how modern technologies can be used to read, parse, and process image-based information in practical AI workflows.

Reading text using UnstructuredLoader

Alex: Lewis, the Unstructured tool can read files in various formats. You also mentioned earlier that LangChain’s directory loader uses UnstructuredLoader by default to load documents. Could you explain this tool in detail?

Lewis: Unstructured is a text extraction toolkit provided by Unstructured.IO. It can run locally or be used via the Unstructured API, and it supports parsing multiple types of documents.

If you wish to run the Unstructured tool with minimal installation, you can execute the following command and install the dependencies for different document types as needed:

pip install unstructured

If you want to call the Unstructured API, you need to execute the following command:

pip install unstructured-client

You will then need to apply for and configure the corresponding API Key.

If you want to use this tool in LangChain, you can run the following command to install the related package:

pip install langchain-unstructured

Besides the general UnstructuredLoader, LangChain has also integrated various other Unstructured document loaders for specific file formats, such as UnstructuredExcelLoader, UnstructuredMarkdownLoader, and UnstructuredImageLoader. For a complete list of loaders, you can visit the official LangChain website for more information.

Reading text using UnstructuredImageLoader

In this section, we’ll choose to use UnstructuredImageLoader to attempt reading an image containing English text.

Figure 1.10: A warrior monkey in ornate armor stands holding a staff with Black Myth Wukong text above

Let’s load the image and read the pixels:

from langchain_community.document_loaders import UnstructuredImageLoader
image_path = "data/black myth/Black Myth in English.jpg"
loader = UnstructuredImageLoader(image_path)
data = loader.load()
print(data)

This code yields the following output:

yolox_l0.05.onnx: 100%|██████| 217M/217M [00:01<00:00, 116MB/s]
[Document(metadata={'source': 'data/black myth/Black Myth in English.jpg'}, page_content=',\n\nPons\n\n= ens eens WUKONGY\n\n4')]

From the preceding output, you can see that this process invoked deep learning models (such as YOLO) to analyze the image pixel information in order to recognize and extract the text content, which is one way OCR technology works. Additionally, you can specify other OCR methods such as Tesseract by setting parameters. The extracted text is also encapsulated in a Document object.

However, in this example, “Black Myth WUKONG” was incorrectly recognized as “ens eens WUKONGY,” which indicates the OCR result is not ideal. In such cases, it may be due to the characteristics of the image itself that make accurate recognition difficult.

Reading text from a PPT

Alex: This is just text extraction. What about analyzing image content?

Lewis: The Unstructured tool specializes in extracting and parsing textual content from files, rather than analyzing or processing images themselves. This means it cannot directly read the content of images, nor can it tell us there’s a majestic monkey in a picture. If you want to understand the specific content of images (or files such as PPTs or PDFs), you need to call the API of a large model or use a local multimodal model capable of analyzing images (such as BLIP).

Alex: LangChain’s loader integrates the capabilities of external tools like Unstructured and generates Document objects. If we directly use these external tools for file parsing, that’s also totally fine, right?

Lewis: That’s right.

The following code example demonstrates how to directly use the partition_ppt function of the Unstructured tool to read text from a PPT (refer to https://github.com/PacktPublishing/RAG-from-First-Principles for the complete code):

from unstructured.partition.ppt import partition_ppt
ppt_elements = partition_ppt(filename="data/black myth wukongPPT.pptx")
for element in ppt_elements:
    print(element.text)

Facing Destiny
Prologue
"Black Myth: Wukong" is a Chinese mythological action RPG adapted from "Journey to the West". Players take on the role of the "Chosen One" and pursue the secrets behind the legend during a perilous Westward adventure.
Adapted from the Chinese fantasy novel "Journey to the West"......
After obtaining the parsed raw text, the next step is to manually convert the contents of pdf_elements into LangChain's Document object.

from langchain_core.documents import Document
documents = [
    Document(page_content=element.text,
             metadata={"source": "data/black myth wukongPPT.pptx"})
    for element in ppt_elements
]
print(documents[0:3])

This is equivalent to manually implementing a UnstructuredPPTLoader required by LangChain.

Using large models for holistic image-text parsing

In Q&A systems, we hope to directly upload PDF or PPT files to the knowledge base and answer questions based on the image content within them. To achieve holistic parsing of images, certain tools (such as Unstructured) cannot accomplish this yet. However, modern multimodal large models can easily complete this task.

For example, when we upload a PDF file containing both text and images, the model first parses the image content and generates a description such as “A majestic monkey stands on the mountaintop, surrounded by drifting clouds,” then builds a contextual knowledge base by combining the textual information. This approach makes the integration of text and images more intuitive and complete, enabling cross-modal reasoning. That is, it can provide more complex answers by leveraging both the implicit semantics in the image and the information in the text. For example, it can answer questions like What background environment might be related to the monkey’s ability to ride the clouds?.

Figure 1.11: A PDF page shows a Chinese fantasy game scene with a warrior character and text overlay

To read both images and texts from a PDF file, the implementation steps begin with invoking a large model to generate a caption for each page, and then converting these captions into LangChain’s required Document objects. (Running this program requires setting an OpenAI API Key in the environment variable.) Let’s look at the code for doing this:

Use pdf2image to extract each page from the PDF file as an image:

from pdf2image import convert_from_path
import base64
import os
output_dir = "temp_images"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
## Convert the PDF File to Images and Save to the Specified Directory
images = convert_from_path("data/black myth/black myth wukong.pdf")
image_paths = []
for i, image in enumerate(images):
    image_path = os.path.join(output_dir, f'page_{i+1}.jpg')
    image.save(image_path, 'JPEG')
    image_paths.append(image_path)
print(f"Successfully converted {len(image_paths)} pages")

Next, use a multimodal large model to analyze the extracted images and generate descriptive text:

from openai import OpenAI
client = OpenAI()
print("\nStarting image analysis......")
results = []
for image_path in image_paths:
    with open(image_path, "rb") as image_file:
        base64_image = base64.b64encode(image_file.read()).decode('utf-8')
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Please describe this slide in detail, including its title, main text, and image content."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}"
                        }
                    }
                ]
            }
        ],
        max_tokens=300
    )
    results.append(response.choices[0].message.content)

Finally, convert the generated descriptive text and related metadata into LangChain Document objects

from langchain_core.documents import Document
documents = [
    Document(
        page_content=result,
        metadata={'source': 'data/BlackMyth/BlackMythWukong.pdf', 'page_number': i+1}
    )
    for i, result in enumerate(results)
]
print('\nAnalysis results:')
for doc in documents:
    print(f'Content: {doc.page_content}\nMetadata: {doc.metadata}\n')

Wukong title

The main title is “Wukong,” accompanied by a red seal-style graphic.

Body

The central text is “BLACK MYTH WUKONG,” indicating this slide may be related to a game or project called “Black Myth: Wukong.”

The date below shows “08.20,” possibly hinting at an important release or event date.

There is also a phrase “Face Your Destiny,” which may convey the game’s theme or challenge.

Image content

The background of the slide features a mysterious black patterned design.

In the center is a close-up of a character’s face, featuring dense fur and intense eyes, giving a strong sense of power and aura.

The character’s expression is stern, highlighting its characteristics and creating a striking atmosphere.

Overall, this slide appears to showcase game information related to “Wukong,” delivering a fusion of ancient mythology and modern gaming.

Prologue title

Section (“Prologue”)

Body

The body describes a game called Black Myth: Wukong, a Chinese mythology action role-playing game adapted from Journey to the West. It highlights the player’s role as an adventurer in the game environment and explores background stories related to the Destined One.

Image content

The image shows a circle of blazing halo, seemingly glowing on a rock, creating a mysterious and fantastical atmosphere. The background is dark-toned, adorned with intricate patterns, adding visual appeal to the image.

Overall, the slide aims to introduce the game’s theme and background, using visual elements to enhance the feeling of exploring mythological stories.

Adaptations from Journey to the West

Adapted from the Chinese mythological novel “Journey to the West.”

Body

The slide contains no other text except the title.

Image content

The background is dark-toned, seemingly with a blurred human figure, possibly relevant to the plot or characters of Journey to the West, depicting a mysterious and classical atmosphere.

Overall, the design style of this slide likely aims to create a sense of history and mystery, focusing on the classic work Journey to the West.

Through the above steps, we have successfully converted the images and their contained text content from each page of the PDF file into structured text descriptions. Next, by combining frameworks such as LangChain and LlamaIndex, we can convert this textual information and corresponding image descriptions into Document objects. After saving these Document objects to the knowledge base, they can become part of the RAG system, enabling the question-answering engine in subsequent processes to utilize this text information to answer user queries.

Importing table data in CSV format

When processing and parsing data, importing CSV files is a common requirement. The CSVLoader tool provided by LangChain can fulfill this need.

Importing data using CSVLoader

When loading CSV files, CSVLoader automatically generates page_content and metadata for each row of data. Among these, metadata includes the data source (source) and row number (row), which is very useful for subsequent data processing and querying.

Figure 1.12: Spreadsheet table with category names, descriptions, and power levels in numbers

The following code example demonstrates a simple operation process (refer to https://github.com/PacktPublishing/RAG-from-First-Principles for the complete code):

from langchain.document_loaders import CSVLoader
file_path = 'data/BlackMyth/BlackMythWukong.csv'
loader = CSVLoader(file_path=file_path)
data = loader.load()
for record in data[:2]:
    print(record)

page_content='Category: Equipment Name: Bronze Cloud Staff Description: A sturdy bronze staff that emits a sharp sound when swung, suitable for melee attacks. PowerLevel: 85'
metadata={'source': 'data/BlackMyth/BlackMythWukong.csv', 'row': 0}
page_content='Category: Equipment Name: Hundred Tricks Undergarment Armor Description: A finely crafted battle armor that provides strong defense and resists potent poison damage. PowerLevel: 90'
metadata={'source': 'data/BlackMyth/BlackMythWukong.csv', 'row': 1}

In this example, page_content contains the detailed contents of each row, while metadata provides the file path and row number of the data source, which is helpful for subsequent data querying and processing. It is important to note that the first row of the CSV file is considered the header row, and its contents are used as the field names by default, i.e., column names.

In the code example below, we specify some parameters of the CSV file through csv_args and use custom column names:

loader = CSVLoader(
file_path=file_path,
csv_args={
'delimiter': ',',
'quotechar': '\'',
'fieldnames': ['Category', 'Name', 'Description', 'PowerLevel'],
},
)
data = loader.load()
for record in data[:2]:
    print(record)

page_content='Category: Category Name: Name Description: Description PowerLevel: PowerLevel'
metadata={'source': 'data/black myth/black myth wukong.csv', 'row': 0}
page_content='Category: Equipment Name: Bronze Cloud Staff Description: A sturdy bronze staff that can produce a swooshing sound when swung, suitable for melee attacks. PowerLevel: 85'
metadata={'source': 'data/black myth/black myth wukong.csv', 'row': 1}

After this processing, new column names such as Category, Name, etc., will replace the original field names in page_content. Since additional field names are specified, this import method will treat the first line inside the file directly as a data row rather than a header row.

You can use a specific column from the CSV file to set the value for the metadata source. The content from this column will replace the default CSV filename and become the source identifier for each document entry. This is demonstrated in the following code example:

loader = CSVLoader(file_path=file_path, source_column='Name')
data = loader.load()
for record in data[:2]:
    print(record)

page_content='Category: Equipment Name: Bronze Cloud Rod Description: A sturdy bronze rod that produces a whistling sound when swung, suitable for close combat attacks. PowerLevel: 85'
metadata={'source': 'Bronze Cloud Rod', 'row': 0}
page_content='Category: Equipment Name: Hundred Tricks Coin Armor Description: A finely crafted combat armor that provides strong defense and resists powerful poison damage. PowerLevel: 90'
metadata={'source': 'Hundred Tricks Coin Armor', 'row': 1}

In this example, the source_column parameter specifies the Name column as the data source. Therefore, the source field in metadata takes the value of the corresponding Name column for each row. For example, for the first record, the source value is “Bronze Cloud Rod”; for the second record, it is “Hundred Tricks Coin Armor”.

This newly generated metadata information is very useful when querying specific items. For example, in a Q&A chain, if you want to query only records related to Hundred Tricks Coin Armor, you can filter by the source field.

Comparing CSVLoader and UnstructuredCSVLoader

Alex: Lewis, you mentioned earlier that the Unstructured tool can load almost all types of files. Can we compare the results of CSVLoader and UnstructuredCSVLoader?

Lewis: Of course. The following code example shows how to use UnstructuredCSVLoader to load data from a specified path and print it out:

from langchain_community.document_loaders import UnstructuredCSVLoader
loader = UnstructuredCSVLoader(file_path=file_path)
data = loader.load()
print(data)

[Document(metadata={'source': 'data/black myth/black myth wukong.csv'}, page_content='\n\nCategory\nName\nDescription\nPowerLevel\n\nEquipment\nBronze Cloud Staff\nA sturdy bronze staff that makes a whistling sound when swung, suitable for close combat.\n85\n\nEquipment\nHundred Show Lining Armor\nAn exquisite battle armor that provides strong defense and resists poisonous damage.\n90\n\nSkill\nHeavenly Thunder Strike\nSummon heavenly thunder to attack enemies, causing a wide range of lightning damage.\n95\n\nSkill\nFlame Dance\nPerform a fiery dance, surrounding enemies in searing flames.\n92\n\nCharacter\nWukong\nThe protagonist, possesses the abilities of seventy-two transformations and riding clouds and mist, upholding justice.\n100\n\nCharacter\nSilver Horn King\nOne of the powerful demon kings, skilled at wielding various magical artifacts, with extremely high combat power.\n88\n\n')]

Alex: For CSV files, LangChain’s CSVLoader is more practical than UnstructuredCSVLoader because the document structure is preserved better. Each row is processed as an independent Document object, and the metadata retains the important row id field, which can be used as a “data source index” in the retrieval process. Of course, if your task requires treating the entire CSV file as a single text block, that’s a different story.

Lewis: Yes, preserving the structural information of the original document as much as possible is an eternal goal in data ingestion for RAG systems, and it is also a challenge. For example, row numbers in CSV files, headings and hierarchy in Markdown files, and the placement of images on PDF pages with both text and images—all these are factors that need to be considered during the data ingest process.

Alex: So, Lewis, if I use DirectoryLoader to load multiple types of documents at once, and I want to use the default loader for files like PDFs, but use CSVLoader for CSV files, how should I do it?

Lewis: That’s also very simple. You can refer to the following code to do it (refer to https://github.com/PacktPublishing/RAG-from-First-Principles for the complete code):

loader = DirectoryLoader(
path= 'data/Black Myth',
glob='**/*.csv',   # Pattern to match all CSV files
loader_cls=CSVLoader  # Specify using CSVLoader for matched files
)

After this setup, DirectoryLoader will use CSVLoader to load all CSV files located in the specified directory instead of using the default UnstructuredCSVLoader.

Crawling and parsing web documents

In this section, we will explore how to crawl web pages and convert them into LangChain’s Document objects. Web content not only includes textual information but also images and other multimedia elements, which are usually encoded in HTML format and may contain links to other pages or resources.

LangChain provides various web document loaders to accommodate different application scenarios. These loaders are described in Table 1.1:

Document Loader	Description	Package/API	Features
WebBaseLoader	Loads and parses web pages using urllib and BeautifulSoup	Package	Easy to operate, suitable for basic web content crawling
UnstructuredLoader	Loads and parses web pages using the Unstructured tool	Package	Supports complex web structures, suitable for handling heterogeneous content
RecursiveURLLoader	Recursively crawls all sub-links starting from a root URL	Package	Automates the link-crawling process; suitable for large-scale website data collection
SitemapLoader	Crawls all web pages based on the provided sitemap	Package	Efficiently parses website structure and quickly obtains all web content
Firecrawl	Provides a locally deployed API service; hosted version offers a free quota	API	Flexible and scalable, suitable for applications requiring real-time crawling and conversion

Table 1.1: Various document loaders and their descriptions

In the following subsections, we will focus on the specific implementations of WebBaseLoader and UnstructuredLoader.

Parse web pages with WebBaseLoader

You can use WebBaseLoader to quickly load web page files and generate a Document object for each page containing “flattened” string content.

First, you need to install the beautifulsoup4 library:

pip install beautifulsoup4

The following code example demonstrates how to load the Wikipedia page for Black Myth: Wukong into a Document object:

import bs4
from langchain_community.document_loaders import WebBaseLoader
page_url = 'https://zh.wikipedia.org/wiki/black myth,Wukong'
loader = WebBaseLoader(web_paths=[page_url])
docs = loader.load()
print(f"{docs[0].metadata}\n")
print(docs[0].page_content.strip())

The preceding code will extract the text and produce the following output:

{'source': 'https://zh.wikipedia.org/wiki/black myth,Wukong', 'title': 'black myth,Wukong - Wikipedia, the free encyclopedia', 'language': 'zh'}Black Myth: Wukong - Wikipedia, the free encyclopedia Skip to content Main menu Main menu Move to sidebar Hide navigation Home Category Index Featured Content News......Table of Contents Move to sidebar Hide Preface 1 Gameplay 2 Plot 2.1 Setting 2.2 Story 2.2.1 Prologue 2.2.2 Seeking Root......

The preceding method extracts the complete text of the page but may include unnecessary information, such as titles or navigation bars. If you are familiar with the HTML structure of the web page, you can use BeautifulSoup to specify the required <div> class names to filter out unwanted content.

The following code example demonstrates how to parse and extract only the main body of the web page content:

loader = WebBaseLoader(
web_paths=[page_url],
bs_kwargs={ 'parse_only': bs4.SoupStrainer(id='bodyContent'), }, # Only parse the main part of the webpage content
bs_get_text_kwargs={'separator': ' | ', 'strip': True},
)

Here’s how the extracted web page content looks in the output:

{'source': 'https://zh.wikipedia.org/wiki/black myth,Wukong'}
Wikipedia, The Free Encyclopedia | Black Myth: Wukong | Genre | Role-playing | | Platform | Microsoft Windows | PlayStation 5 | Xbox Series X/S | Developer | Game Science | ...

Here, parse_only: bs4.SoupStrainer(id="bodyContent") refers to the HTML element in the webpage with id="bodyContent". This usually represents the main content section of the webpage, mainly comprising the core information of the article or page, and does not include navigation bars, footers, or other auxiliary elements.

This produces a cleaner result, filtering out meaningless link text such as Jump to content, Main menu, Move to sidebar, Hide navigation, Home Page, Category, Index, Featured Content, and directly focusing on the knowledge main body.

Granular webpage parsing with UnstructuredLoader

If you need more granular control over the content, a more advanced parsing method can be chosen, such as parsing using the Unstructured Loader. This method is suitable for scenarios where precise indexing of specific webpage content is required. After processing, multiple Document objects will be generated for each webpage, each representing different structures on the page, such as titles, main text, lists, or tables.

First, ensure that the langchain-unstructured interface package is installed. Here, we use the Unstructured package via local invocation (later content will also show how to call the Unstructured tool via API):

pip install 'langchain-unstructured[local]'

The following code example demonstrates how to use the Unstructured tool to load the same webpage:

from langchain_unstructured import UnstructuredLoader
page_url = 'https://zh.wikipedia.org/wiki/black myth,Wukong'
loader = UnstructuredLoader(web_url=page_url)
docs = loader.load()
for doc in docs[:5]:
    print(f'{doc.metadata['category']}: {doc.page_content}')

Title: Black Myth: Wukong
ListItem:
![Arabic word in bold black script on a white background with distinct looping and dots above](../images/9787115671851_01_14.png)
ListItem:
![Arabic calligraphy in black on a white background spelling out the word](../images/9787115671851_01_15.png)
ListItem: Azerbaijani
ListItem: Belarusian (Taraškievica)
...

Each Document object output here represents one element of the page. The metadata includes the element’s category, such as title or main text.

Tips from Lewis

In this section, we learned that the Unstructured tool is capable of analyzing various unstructured elements in files and parsing them into Element data objects.

With the help of LangChain’s Unstructured-Loader, we can further convert these Element data objects into Document objects.

The parsed page elements may have parent-child relationships. For example, a paragraph might belong to a specific heading or table (where the category is Title or Table). You can extract and combine these page elements using the following code:

from langchain_unstructured import UnstructuredLoader
from typing import List
from langchain_core.documents import Document
page_url = 'https://zh.wikipedia.org/wiki/black myth,Wukong'
def _get_setup_docs_from_url(url: str) -> List[Document]:
    loader = UnstructuredLoader(web_url=url)
    setup_docs = []
    for doc in loader.load():
        if doc.metadata['category'] == 'Title' or doc.metadata['category'] == 'Table':
            parent_id = doc.metadata['element_id']
            current_parent = doc  # Update the current parent element
            setup_docs.append(doc)
        elif doc.metadata.get('parent_id') == parent_id:
            setup_docs.append((current_parent, doc))  # Store the parent and child elements together
    return setup_docs   
docs = _get_setup_docs_from_url(page_url)
for item in docs:
    if isinstance(item, tuple):
        parent, child = item
        print(f'Parent element - {parent.metadata['category']}: {parent.page_content}')
        print(f'Child element - {c
hild.metadata['category']}: {child.page_content}')
    else:
        print(f'{item.metadata['category']}: {item.page_content}')

In the preceding code, the current_parent variable is used to store the current parent element. When a child element is encountered, it will be stored together with the current parent element as a tuple. During output, it checks whether the element is a tuple; if so, it prints the parent and child elements separately. This ensures that each child element and its corresponding parent element can be clearly displayed.

Markdown file titles and structure

Alex, by now you may have noticed that we emphasize the importance of preserving the original information after loading a document. Indeed, the inherent format of these documents (such as line IDs in CSV files or the element hierarchy in HTML files) contains structural or relational information that could play an important role in the indexing, retrieval, and generation processes of a RAG system.

Why Markdown?

The Markdown documents we are about to discuss are an extremely important file type when building a RAG system. There are several reasons behind the approach of unifying source data into Markdown format:

Markdown is a lightweight markup language, easy to read and parse. Compared with more complex markup languages like HTML or XML, Markdown syntax is much simpler and clearer, making it easier to parse both manually and automatically. This is beneficial for preprocessing, splitting, and summarizing documents, and the subsequent steps of feature extraction and index building.
Its style is close to the training data used by large models: Most large models (such as ChatGPT and DeepSeek) have already been exposed to a large amount of Markdown-formatted text (including GitHub READMEs, technical documents, blog articles, etc.) during training. This means that when facing Markdown content, these large models can extract useful information more effectively and generate more natural and appropriate responses.
Retains the hierarchical structure and basic formatting information of the text: Markdown can preserve titles, paragraphs, lists, tables, code blocks, and other structural information in a relatively simple way. This capability helps large models in RAG systems understand the logical hierarchy and semantic partitioning of the text, thereby enabling better referencing and organization of information when answering questions.
Unifies and simplifies data formats: Since different data sources (such as HTML, PDF, CSV tables, database texts) have significant differences in format and structure and may include complex HTML tags or different encoding methods, converting all data to Markdown can, to a certain extent, achieve format unification and simplify subsequent processing steps.
Convenient for subsequent presentation: In the final output, the RAG system can directly produce text in Markdown format, allowing answers on the front-end interface (such as chat windows) to have both good readability and visual effect, without the need for additional format conversion.

Therefore, Markdown format is not only beneficial for data preprocessing and large models’ understanding and parsing of data, but also convenient for clearly presenting information. It is worth noting that Markdown files also contain hierarchical structure information. Each heading has corresponding content underneath it, which should not be separated. This means that during the parsing process, ensuring the hierarchical structure of “heading - the text under it” is very important.

Figure 1.13: A screenshot shows a Markdown editor with game introduction notes and their preview

Implementing UnstructuredMarkdownLoader

Next, let’s look at the application details of UnstructuredMarkdownLoader. In its default mode, UnstructuredMarkdownLoader loads the entire Markdown file as a single Document object. This means that the parsed content is stored together in a data list, and that list contains only one Document object whose page_content attribute will contain the full text content of the file. This approach is especially suitable for handling documents that are short in content or do not need further subdivision, as it is convenient for overall reading and processing. Here’s how you use this function (refer to https://github.com/PacktPublishing/RAG-from-First-Principles for the complete code):

from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_core.documents import Document
markdown_path = "data/black myth/Black Myth Version Introduction.md"
loader = UnstructuredMarkdownLoader(markdown_path)
data = loader.load()
print(data[0].page_content)

Here are the application details from the output:

Game Version Introduction
Digital Standard Edition
Includes the base game
Digital Deluxe Edition

When mode="elements" is enabled, UnstructuredMarkdownLoader will parse the Markdown file into multiple elements. As shown in the following code, each element will be treated as an independent Document object, representing a separate content block, such as headings, paragraphs, list items, and so on. This approach allows for finer granularity in handling course content, making it easier for indexing and retrieval.

loader = UnstructuredMarkdownLoader(markdown_path, mode="elements")
data = loader.load()
print(f"Number of documents: {len(data)}\n")
for document in data:
    print(f"{document}\n")

The following output shows the result after the parsing operation:

Number of documents: 22
page_content='Black Myth: Wukong ' metadata={'source': 'data/black myth/Black Myth Version Introduction.md', 'category_depth': 0, 'languages': ['zho'], 'file_directory': 'data/black myth', 'filename': 'Black Myth Version Introduction.md', 'filetype': 'text/markdown', 'last_modified': '2024-11-26T12:15:59', 'category': 'Title', 'element_id': 'b89add9386b58a1638e0b96d19f08d0d'}
page_content='Black Myth: Wukong is a highly anticipated action-adventure game developed by a Chinese game development team. Inspired by Journey to the West, it reinterprets the classic story, delivering an impactful visual and gaming experience.' metadata={'source': 'data/black myth/Black Myth Version Introduction.md', 'languages': ['zho'], 'file_directory': 'data/black myth', 'filename': 'Black Myth Version Introduction.md', 'filetype': 'text/markdown', 'last_modified': '2024-11-26T12:15:59', 'parent_id': 'b89add9386b58a1638e0b96d19f08d0d', 'category': 'UncategorizedText', 'element_id': '4d1fd58a257960aafb046fc47605c217'}

When parsing complex documents, the category field in the metadata (such as Title) helps to understand the document structure and provides meaningful context. For example, a title usually indicates the topic or classification of the following content. Therefore, treating the title as a separate Document object helps to locate and organize content for subsequent retrieval or analysis. For instance, in a RAG system, content from specific sections can be filtered according to the title, enabling more precise answers to learners’ questions. In addition, through the parent_id field in the metadata, it’s possible to further determine which elements belong to a particular title, thus organizing relevant content as a unified text block.

Text formats, layout recognition, and table parsing in PDF files

For most RAG systems, parsing PDF files is a key step in building the system. PDF files not only contain textual information but may also include elements such as tables and images, making their parsing more challenging compared to other types of documents.

Currently, common parsing methods for handling PDF files can be roughly divided into three categories: rule-based parsing, deep learning-based parsing, and multimodal large model-based parsing.

PDF parsers that use these methods may perform the following operations:

Recombine separated text boxes into logical units, such as lines or paragraphs, through heuristic methods or machine learning techniques
Apply OCR technology to images in the file to recognize and extract the text within them
Classify textual content to determine whether it belongs to paragraphs, lists, tables, or other structures
Organize the extracted text into table formats, or present data in key-value pairs

Tips from Lewis

Many modern large models now support multimodal input and can directly process multimedia files such as images and PDFs. We already provided an example earlier.

In some application scenarios, especially those requiring question answering and analysis of PDF documents with complex layouts, charts, or illustrations, the PDF documents can be passed directly to the large model for understanding, without first converting them into a simpler format.

Using loading tools for PDF files

LangChain can integrate with a variety of PDF parsers. Among these parsers, some are designed to be simple and relatively basic, suitable for lightweight text parsing scenarios, while others support OCR functions, mathematical formula processing and image analysis, or can perform advanced document layout analysis.

Parser	Description	Package/API	Features
PyPDF	Loads and parses PDF files using `pypdf`	Package	Efficient and lightweight, suitable for handling simple PDF files
Unstructured	Loads PDF files using the open-source Unstructured tool library	Package/API	Supports multiple document formats, has content extraction and analysis capabilities
Amazon Textract	Loads PDF files via AWS API	API	Provides cloud service support, suitable for large-scale document OCR processing
Mathpix	Loads and parses PDF files using MathPix	API	Specially designed for mathematical formulas, can accurately parse complex content
PDFPlumber	Loads PDF files using PDFPlumber	Package	Provides rich PDF content control and processing functions
PyPDFDirectory	Loads PDF files in directories	Package	Supports batch loading, convenient for processing multiple PDF files
PyPDFium2	Loads PDF files using PyPDFium2	Package	Efficient parsing, supports rendering and conversion of PDF pages
PyMuPDF	Loads PDF files using PyMuPDF	Package	Speed-optimized, supports fine-grained processing of complex PDF files
PDFMiner	Loads PDF files using PDFMiner	Package	Suitable for text extraction, especially adept at handling PDFs containing embedded text

Table 1.2: Parser functions and their features

The difficulty of deploying and using these tools can be analyzed from several perspectives, such as the following:

Local deployment type: Tools such as PyPDF, PDFPlumber, and PDFMiner are all Python libraries, so their installation and use are relatively simple. Such tools usually only require installation via pip or other package managers to get started quickly, making them suitable for users who want to avoid complicated configuration processes.
API service type: Amazon Textract and MathPix belong to this category and require applying for an API Key, often involving paid usage. Although these services offer powerful functions, such as batch document processing and mathematical formula parsing, their usage threshold is relatively high.
Hybrid type: Unstructured, as an open-source library, can be used directly for its basic functions, but fully utilizing all its features may require additional service support.

From the perspective of functional characteristics, PyPDF is a lightweight tool that provides basic PDF text extraction; PDFPlumber excels at processing table data and shows strong capability in layout analysis; PyMuPDF offers comprehensive functions, supporting PDF rendering, editing, and fine-grained processing of complex documents; Amazon Textract has OCR capabilities and is especially suitable for scanned documents; MathPix is designed for mathematical formula recognition; PDFMiner has very powerful underlying parsing capabilities, being able to precisely locate text positions.

From the performance perspective, PyPDF is known for its fast processing speed but performs only moderately in terms of accuracy; PyMuPDF excels in both performance and accuracy; Unstructured performs well in handling complex layouts, and thus has been chosen as the default loader for LangChain’s DirectoryLoader. Its API version also provides high-accuracy parsing services, though network conditions may affect processing speed.

From the perspective of application scenarios, for simple document text parsing, PyPDF is sufficient; if you need to process table data in PDF files, PDFPlumber is recommended; when faced with complexly formatted PDF files, PyMuPDF or Unstructured are better choices; for dealing with scanned documents, Amazon Textract is an ideal choice; for math PDF documents with many formulas, MathPix is recommended.

Overall, if you need to process PDF files in bulk at scale, PyMuPDF stands out as particularly well balanced due to its comprehensive features and high efficiency.

PDF parsing is a broad topic. Next, we can further explore some basic approaches. By learning and practicing the specific implementation details of each tool, mastering them will no longer be difficult.

Simple text extraction with PyPDFLoader

If you only need to extract the embedded text as simple string representations from a PDF file, you can use the PyPDFLoader method. This method returns a list of Document objects, with each page corresponding to a Document object. The extracted text will be stored in the page_content attribute of the Document object.

Figure 1.14: Two traditional statues are displayed side by side each depicting a robed figure with unique features

The following code example demonstrates how to install the PyPDF tool:

pip install pypdf

This method does not parse images or scanned PDF pages (i.e., it does not support OCR functionality), as shown in the following code:

from langchain_community.document_loaders import PyPDFLoader
file_path = "data/black myth/ Kang Jinlong and Lou Jingou.pdf"
loader = PyPDFLoader(file_path)
pages = loader.load()
print(f"Loaded {len(pages)} page(s) of PDF document")
for page in pages:
    print(page.page_content)

The preceding code returns standard text content:

Loaded 1 page(s) of PDF document
Some characters in the game, such as Kang Jinlong (left) and Lou Jingou (right), draw their inspiration from the painted sculptures at the Jade Emperor Temple in Jincheng, Shanxi. The character Kang Jinlong appears in both human and dragon form, serving as a boss enemy.

In the next section, we will see how to convert PDF documents to Markdown format.

Using the Marker tool to convert PDF documents to Markdown format

When dealing with PDF documents containing structured content, such as hierarchical headings used to organize and logically structure the content, the ideal approach is to preserve this hierarchy while parsing the PDF document. For such needs, converting PDF documents to Markdown format is a good choice. Standardizing all types of text to Markdown format helps simplify subsequent processing and analysis steps.

Figure 1.15: Wikipedia page about Yungang Grottoes with a photo of large Buddha statues carved in rock

In the process of converting PDF documents to Markdown format, at a minimum, the following key elements should be preserved:

Markdown heading hierarchy: In the document, headings are used to organize the content of different sections. This hierarchy not only helps learners navigate quickly, but also enhances the comprehensibility, overall readability, and organization of the document.
Image and text structure: Given that many PDF documents contain charts or multi-column layouts to further explain textual content, it is necessary to parse images during the conversion process and save them in a dedicated image directory, then embed them in the Markdown file. Meanwhile, appropriate formatting should be used to retain the presentation of tabular data, ensuring the accuracy and completeness of information delivery.

The following figure shows a document about Yungang Grottoes listing UNESCO site details in a table:

Figure 1.16: Screenshot of a document about Yungang Grottoes listing UNESCO site details in a table

Lewis: Although there are many open-source or commercial tools for converting PDF to Markdown, I personally recommend a tool with a very good user experience—Marker (another highly regarded tool is Docking). Marker can effectively remove headers, footers, and other irrelevant content, support formatting of tables and code blocks, and convert most formulas into LaTeX format, which is especially useful for scientific papers. In addition, it can accurately extract images. The core of Marker comprises a series of deep learning models, specifically designed for text extraction, OCR, page layout detection, and formatting cleaning. Marker intelligently selects the most suitable model according to the specific format of the PDF file to be parsed, ensuring the optimal balance between parsing speed and accuracy.

Alex: Lewis, this is the second type of PDF parsing method you mentioned at the beginning of this section – Deep Learning-based parsing.

The following code example demonstrates how to use Marker to parse PDF documents.

First, install Marker with the following command:

pip install marker-pdf

Next, you can directly use the command line to parse PDF files:

marker_single ''data/Shanxi Cultural Tourism/Yungang Grottoes-en.pdf''

In addition, you can also use the following code example to parse PDF files (refer to https://github.com/PacktPublishing/RAG-from-First-Principles for the complete code):

import os # Import os library
import subprocess # Import subprocess library
def convert_pdf_to_markdown(input_pdf_path, output_folder, batch_multiplier=2, max_pages=12):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    command = [
        'marker_single',
        input_pdf_path,
        output_folder,
        f'--batch_multiplier={batch_multiplier}',
        f'--max_pages={max_pages}'
    ]
    try:
        subprocess.run(command, check=True)
        print(f"PDF document successfully converted to Markdown format, files saved to {output_folder}")
    except subprocess.CalledProcessError as e:
        print(f"PDF document conversion failed: {e}")
if __name__ == "__main__":
    input_pdf_path = "data/Shanxi Cultural Tourism/Yungang Grottoes-en.pdf"
    output_folder = "data/marker/output/Yungang Grottoes-en"
    convert_pdf_to_markdown(input_pdf_path, output_folder)

The parsing operation returns the following output:

Loaded detection model **vikp/surya_det3** on device **cuda** with dtype **torch.float16**
Loaded detection model **vikp/surya_layout3** on device **cuda** with dtype **torch.float16**
Loaded reading order model **vikp/surya_order** on device **cuda** with dtype **torch.float16**
Loaded recognition model **vikp/surya_rec2** on device **cuda** with dtype **torch.float16**
Loaded texify model to **cuda** with **torch.float16** dtype
Loaded recognition model **vikp/surya_tablerec** on device **cuda** with dtype **torch.float16**
Detecting bboxes: 100%|█████████████| 1/1 [00:01<00:00,  1.10s/it]
Recognizing Text: 100%|█████████████| 4/4 [03:18<00:00, 49.75s/it]
Detecting bboxes: 100%|█████████████| 1/1 [00:02<00:00,  2.16s/it]
Finding reading order: 100%|██████████| 1/1 [00:00<00:00,  2.05it/s]
Recognizing tables: 100%|████████████| 1/1 [00:07<00:00,  7.89s/it]
Saved markdown to the `data/marker/output/Yungang-Grottoes-en` folder
Total time: 238.9289002418518
PDF document was successfully converted to Markdown format, file has been saved to `data/marker/output`

After running the program, a Markdown file will be generated, along with a series of parsed PNG image files and a JSON file containing metadata information.

Figure 1.17: A file explorer shows two folders with images and documents, mainly PNGs and a JSON file

Open the Yungang Grottoes-en.md file, and you can see the expected content.

Figure 1.18: Screenshot compares Markdown code on the left with its formatted preview of the Yungang Grottoes on the right

Marker has achieved automatic conversion from PDF documents to Markdown format, accurately preserving the formatting information from the original PDF document. Additionally, it provides flexible configuration options, including batch processing and page number limits, allowing users to adjust performance and resource usage according to their needs.

This Markdown-format document can not only be read as a Document object by frameworks such as LangChain or LlamaIndex, but also serve as the original material for RAG-based knowledge bases.

Structured parsing with UnstructuredLoader

In the previous section, we explored how to use Marker to parse PDF documents into Markdown format. However, in some cases, simply relying on this conversion method may not meet all requirements. For example, when it’s necessary to segment the text at a finer granularity (such as dividing by paragraphs, headings, or table structures), or to extract text from images containing text, more detailed approaches are required.

As we know, the UnstructuredLoader provided by LangChain returns a list composed of Document objects. Each Document object represents an independent structure or element on the page and contains rich metadata, which greatly facilitates subsequent document analysis and processing.

The following figure shows a JSON file containing the description of the Yungang Grottoes’ coordinates and layout info:

Figure 1.19: A screenshot of JSON data describing the Yungang Grottoes coordinates and layout info

In such a data structure, not only is basic metadata (such as page number and text content) preserved, but also detailed layout information (such as the coordinates and types of elements):

Basic metadata: Includes page_number (page number information), category (used to distinguish different types of content, such as “NarrativeText” for narrative text), content (text content, for example, basic information describing the Yungang Grottoes), element_id (the unique identifier of the element), and parent_id (the identifier of the parent element, which helps to understand the hierarchical structure of the document and facilitates the structured processing of document content).
Layout information (coordinates): Includes the coordinates of the top-left, bottom-left, top-right, and bottom-right corners, the coordinate system type, and page width and height information. With these coordinates, precise layout analysis can be performed to determine the exact position of the content, extract content from specific areas, or reconstruct the visual layout of the document. This supports location-based content filtering, sorting, and layout reconstruction and rearrangement.

Large models with powerful capabilities may be able to automatically achieve perfect PDF layout restoration based on the above information, thereby greatly improving the efficiency and accuracy of document processing.

Alex: Lewis, I don’t understand this part. Since we have to restore it sooner or later, why are we putting so much effort into breaking down the PDF document?

Lewis: This means you haven’t truly understood the RAG system. The purpose of breaking it down is to make the subsequent vectorization process more refined; only by splitting the document into independent elements can we accurately retrieve information in the RAG system according to the user’s query. For example, when a user asks what year the Yungang Grottoes were built, we need to quickly locate the relevant document fragment, rather than passing the entire PDF document to the large language model. That way, it would not only be less accurate but also waste token resources. The restoration during the generation process is for ultimately presenting a complete, visually rich answer to the user.

In addition to the previously installed langchain-unstructured interface package, since we are going to demonstrate how to call the Unstructured tool through an API here, we also need to obtain an Unstructured API Key and set the UNSTRUCTURED_API_KEY environment variable. The following code example shows how to use UnstructuredLoader to parse a PDF document:

file_path = ("data/Shanxi Cultural Tourism/Yungang Grottoes-en.pdf")
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
file_path=file_path,
strategy="hi_res",
partition_via_api=True, # If you are calling the local Unstructured tool, comment out this line and the next line
coordinates=True, # Call the Unstructured tool via API and return element coordinates
)
docs = []
for doc in loader.lazy_load():
docs.append(doc)

The generated doc files contain not only the text content but also structural information.

Tips from Lewis

strategy="hi_res" means “high-resolution mode.” It attempts to capture more details in the document, especially complex layouts, tables, coordinates, and similar content.

When using this strategy to process documents, it is suitable for PDF files with very complex formats, such as files containing images, tables, graphics, and multiple columns of text. This mode leverages more advanced technologies, such as OCR analysis and high-resolution image parsing.

If the fast mode is used instead (setting strategy=”fast”, which is the default mode), processing is faster and consumes fewer resources. It can complete basic parsing tasks, but its support for complex layouts is less detailed, making it more suitable for PDF documents with relatively simple formats.

Next, extract the document structure of each page using the following function:

def extract_basic_structure(docs):
    """Basic structure extraction: organize content by document type"""
    # Define category mapping
    category_map = {
        'Title': 'title',
        'NarrativeText': 'text',
        'Image': 'image',
        'Table': 'table',
        'Footer': 'footer',
        'Header': 'header'
    }
    # Initialize structure dictionary
    structure = {cat: [] for cat in category_map.values()}
    structure['metadata'] = []  # Add metadata category
    # Iterate over documents and classify
    for doc in docs:
        category = doc.metadata.get('category', 'Unknown')
        content = {
            'content': doc.page_content,
            'page': doc.metadata.get('page_number'),
            'coordinates': doc.metadata.get('coordinates')
        }
        target_category = category_map.get(category)
        if target_category:
            structure[target_category].append(content)
    return structure
## Call the Function to Extract the Document Structure
structure = extract_basic_structure(docs)
Output the content with metadata as Title on page 2.
print('Title on page 2:')
for title in [t for t in structure['title'] if t['page'] == 2]:
    print(f'- {title["content"]}')

If you observe this page of the PDF document, you can see the title “Deterioration and Conservation”. The document’s parsing of title-type elements is accurate in this instance. Of course, sometimes the Unstructured tool may also misclassify elements, categorizing items that do not appear to be titles as titles.

The following code example can be used to display the layout of all elements on a page:

def analyze_layout(docs):
    '''''Analyze the document's layout structure''''''
    layout_analysis = {}
    for doc in docs:
        page = doc.metadata.get('page_number')
        coords = doc.metadata.get('coordinates', {})
        # Initialize page information
        if page not in layout_analysis:
            layout_analysis[page] = {
                'elements': [],
                'dimensions': {
                    'width': coords.get('layout_width', 0),
                    'height': coords.get('layout_height', 0)
                }
            }
        # Get element position information
        points = coords.get('points', [])
        if points:
            # Only need the top-left and bottom-right coordinate points
            (x1, y1), (_, _), (x2, y2), _ = points
            # Construct element information
            element = {
                'type': doc.metadata.get('category'),
                'content': (doc.page_content[:50] + '...') if len(doc.page_content) > 50 else doc.page_content,
                'position': {
                    'x1': x1, 'y1': y1,
                    'x2': x2, 'y2': y2,
                    'width': x2 - x1,
                    'height': y2 - y1
                }
            }
            layout_analysis[page]['elements'].append(element)
    return layout_analysis
## Call the Function to Analyze the Document Layout
layout = analyze_layout(docs)

Next, output the content of the page layout on page 1:

print("Page 1 layout analysis:")
if 1 in layout:
    page = layout[1]
    print(f"Page size: {page['dimensions']['width']} x {page['dimensions']['height']}")
    print("\nElement distribution:")
    ## Sort and display elements by vertical position
    for elem in sorted(page['elements'], key=lambda x: x['position']['y1']):
        print(f"\nType: {elem['type']}")
        print(f"Position: ({elem['position']['x1']:.0f}, {elem['position']['y1']:.0f})")
        print(f"Size: {elem['position']['width']:.0f} x {elem['position']['height']:.0f}")
        print(f"Content: {elem['content']}")

The preceding code sorts and displays elements by vertical position:

Page 1 layout analysis:
Page size: 1700 x 2200
Element distribution:
Type: Header
Position: (827, 41)
Size: 304 x 30
Content: Yungang Grottoes - Wikipedia
Type: Image
Position: (98, 104)
Size: 427 x 142
Content: 4y WIKIPEDIA [ 1 The Free Encyclopedia WIKIPEDIA
Type: Title
Position: (1120, 411)
Size: 326 x 43
Content: Yungang Grottoes......

Document elements may have parent-child relationships (for example, a paragraph may belong to a section with a heading), and you can determine whether it belongs to the target section by checking each element’s category and content. For instance, to extract content under a specific heading (such as the introduction about “Cave 6” on page 3, as shown in the following figure), you can use the code example that follows.

Figure 1.20: Air pollutant studies at Yungang Grottoes and Cave 6’s art

To extract content under a specific heading, you can use the code example that follows:

cave6_docs = []
parent_id = -1
for doc in docs:
    if doc.metadata["category"] == "Title" and "Cave 6" in doc.page_content:
        parent_id = doc.metadata["element_id"]
    if doc.metadata.get("parent_id") == parent_id:
        cave6_docs.append(doc)
for doc in cave6_docs:
    print(doc.page_content)

Here’s how the site description would appear in the output:

Cave 6 is one of the richest of the Yungang sites. It was constructed between 465 and 494 C.E. by The entire Emperor Xiao Wen. The cave's surface area is approximately 1,000 square meters. interior of the cave is carved and painted. There is a stupa pillar in the center of the room extending from the floor to the ceiling. The walls are divided into two stories. The walls of the upper stories are host to carvings of standing Buddhas, Bodhisattvas, and monks among other celestial figures. All of the carvings were painted, but because the caves have been repainted evidently up to twelve times, determining the original scheme is difficult.

Visualizing layout with PyMuPDF and coordinate information

So far, we have obtained detailed element information with coordinates. Next, we can use this data to perform fine-grained layout analysis and visualization.

In this section, we will combine the PyMuPDF library and the coordinate information parsed by UnstructuredLoader to visualize PDF pages and label content regions (such as titles, images, and tables) for easier understanding of the PDF page’s layout structure or for processing information about specific sections.

Tips from Lewis

PyMuPDF is a library widely used for PDF document operations, supporting efficient reading, modification, and rendering of PDF files. PyMuPDF can open and read PDF documents, extract text and images from pages, and access page layout details (such as paragraph coordinates and image positions). At the same time, it also supports converting PDF pages into bitmap formats and allows operations such as scaling and rotation. In addition, PyMuPDF supports modifying PDF documents. You can add text, images, or graphical elements to a PDF and annotate existing documents, such as highlighting text or adding comments, so that PDF layout analysis results can be displayed in other applications.

The following code demonstrates how to use PyMuPDF to read a PDF page and convert it to an image, then use matplotlib to draw the PDF page and add rectangular boxes to mark section regions. Different box colors are set according to the section category (such as “Title”, “Image”, “Table”):

import fitz  # PyMuPDF library, used for processing PDF files
import matplotlib.patches as patches  # Used to draw polygons on images
import matplotlib.pyplot as plt  # Matplotlib library, for plotting
from PIL import Image  # For image processing
def render_pdf_page(file_path, doc_list, page_number):
    # Open the PDF document and load the specified page
    pdf_doc = fitz.open(file_path)
    pdf_page = pdf_doc.load_page(page_number - 1)
    segments = [doc.metadata for doc in doc_list if doc.metadata.get('page_number') == page_number]
    # Convert the PDF page to bitmap format
    pix = pdf_page.get_pixmap()
    pil_image = Image.frombytes('RGB', [pix.width, pix.height], pix.samples)
    # Create a plotting environment
    fig, ax = plt.subplots(figsize=(10, 10))
    ax.imshow(pil_image)
    # Define category-to-color mapping
    category_to_color = {'Title': 'orchid', 'Image': 'forestgreen', 'Table': 'tomato'}
    categories = set()
    # Draw section annotation boxes
    for segment in segments:
        points = segment['coordinates']['points']
        layout_width = segment['coordinates']['layout_width']
        layout_height = segment['coordinates']['layout_height']
        category = segment.get('category', 'Other')
        color = category_to_color.get(category, 'gold')
        # Unpack coordinates to box
        x0, y0 = points[0]
        x1, y1 = points[2]
        width = x1 - x0
        height = y1 - y0
        rect = patches.Rectangle((x0, y0), width, height, linewidth=2, edgecolor=color, facecolor='none')
        ax.add_patch(rect)
        # Annotate category
        ax.text(x0, y0 - 5, category, fontsize=10, color=color, weight='bold', backgroundcolor='white')
        categories.add(category)
    plt.axis('off')
    plt.show()

In the preceding code, the function render_pdf_page will open the given PDF file, render the specified page using PyMuPDF, and draw rectangles on regions specified by your element metadata. Each type of region (title, image, table, etc.) receives a different color for visual clarity. Make sure the coordinate information in your doc_list matches the coordinate system of the PDF rendering for accurate annotation.

Here’s a nice code example for scaling layout coordinates, and generating a matching visualization legend:

layout_height = segment['coordinates']['layout_height']
scaled_points = [(x * pix.width / layout_width, y * pix.height / layout_height) for x, y in points]
box_color = category_to_color.get(segment['category'], 'deepskyblue')
categories.add(segment['category'])
rect = patches.Polygon(scaled_points, linewidth=1, edgecolor=box_color, facecolor='none')
ax.add_patch(rect)
## Add Legend
legend_handles = [patches.Patch(color='deepskyblue', label='Text')]
for category, color in category_to_color.items():
    if category in categories:
        legend_handles.append(patches.Patch(color=color, label=category))
ax.axis('off')
ax.legend(handles=legend_handles, loc='upper right')
plt.tight_layout()

Since the original paragraph coordinates are based on the layout ratio of the PDF page, they need to be scaled according to the actual pixel width and height of the page. After specifying the page number, the program will filter out the paragraphs that belong to that page from the document list and draw the annotation boxes for those paragraphs on the page.

You can call the preceding function for displaying the layout using the following code:

render_pdf_page(file_path,docs, 1)

The PDF is generated:

Figure 1.21: A textbook page with photos, diagram, and text discussing the Yungang Grottoes in China

Alex: Oh, I see. According to the layout, we can organize similar types of information together. For example, we can pass all the elements in the green image layout group in the figure to the large model as a whole, in order to generate question and answer content related to the image (Yungang Grottoes). Without precise layout analysis, this would be quite difficult.

Lewis: Really smart.

Using UnstructuredLoader to parse tables in PDF pages

Next, let’s look at reading table information from PDF pages. The PDF document discussed previously did not contain any tables, so we will switch to a PDF document that includes tables. The 12th page of this file contains data about major cities in Shanxi Province (data sourced from Wikipedia).

First, we will use the same method as before to parse the layout elements of page 12 of the PDF document, shown in the following figure, by calling the statement render_pdf_page(file_path,docs, 12) and visualizing them.

Figure 1.22: A table displays urban populations of cities in Shanxi, China, for 2020 and 2010 plus city proper

The highlighted element in the table indicates that the entire table has been successfully parsed and that the element type is Table:

Figure 1.23: Table listing urban populations for cities in Shanxi, China, with 2020 and 2010 data

Next, we display the metadata of all the elements on page 12:

page_number = 12
page_docs = [doc for doc in docs if doc.metadata.get('page_number') == page_number]
for doc in page_docs:
    print('Metadata:')
    for key, value in doc.metadata.items():
        print(f'  {key}: {value}')

A portion of the output is shown here:

Figure 1.24: Screenshot of metadata for three PDF elements showing file path IDs and categories

Although a lot of metadata information is output here, the key point is that we can see the category in the metadata contains Table, and this Table element has a parent_id. The parent_id links to the table’s title. This is very important because a table cannot exist independently from its associated title. The table element may contain only numbers, while the table title might indicate the meaning of these numbers.

For example, when comparing the GDP of two groups of cities in Shanxi Province, if one table is titled “2024 GDP of Each City” and another table is titled “2025 GDP of Each City”, you must link the elements within the table to those of their corresponding titles. This process is a necessary step in your RAG system. Otherwise, simply possessing the figures without knowing their corresponding years will lead to retrieval results that lack accuracy.

Integrating content under the same title using ParentID

If you need to integrate a table with the title text above it, you can achieve this by following these steps:

Filter by page_number: Filter out all elements on a specific page (such as page 12)
Classify by category: Identify elements of types Table and Title, and determine whether the Title is above the Table (by comparing their y coordinate values)
Integrate tables and titles: Combine the table with its nearest title into one structure and output the integrated information

Alex: However, this logic seems a bit complicated to implement.

Lewis: Indeed. Since the Unstructured tool automatically saves parent-child relationships, a more direct approach is to locate elements with the category “Table”. For each table, find its parent element corresponding to the parent_id, and then output the combined content of the table and its parent element.

Next, use the following function to automatically locate the sub-elements of a table and its parent element and output them as a whole:

def find_tables_and_titles(docs):
    results = []
    for doc in docs:
        ## Check if the document is of table type
        if doc.metadata.get('category') == 'Table':
            table = doc
            parent_id = doc.metadata.get('parent_id')
            ## Find the title document corresponding to the table (parent_id matches element_id)
            title = next((doc for doc in docs if doc.metadata.get('element_id') == parent_id), None)
            if title:
                results.append({'table': table.page_content, 'title': title.page_content})
    return results
results = find_tables_and_titles(page_docs)
if results:
    for result in results:
        print('Found table and title:')
        print(f'Title: {result['title']}')
        print(f'Table: {result['table']}')
else:
    print('No tables and titles found')

The output gives a list of urban areas:

2020 City proper
5,304,061
3,105,591
3,180,884
3,379,498
3,976,481
4,774,508
1,318,505
2,194,545
1,593,444
2,689,668
see Lüliang
3,398,431
see Jinzhong
see Shuozhou
see Jincheng
see Xinzhou
see Yuncheng
see Lüliang
see Linfen
see Yuncheng
see Linfen
see Taiyuan

This way, we have successfully associated the data in the table with its header information. With the header information, we can retrieve relevant tables based on the user’s question.

Let’s look at another example of a parent-child relationship combination. In the following screenshot, you can see that there are a total of four child elements under the Title element External links. If you need to combine these four child elements with their parent element (i.e., the title) and output them as an integrated chunk, you can do so through the parent-child relationship.

The following code example demonstrates how to integrate these related pieces of information:

external_docs = [] # Create a list to store child documents of external links
parent_id = -1 # Initialize parent_id as -1
for doc in docs:
    # Check if the document is of type Title and its content contains 'External links'
    if doc.metadata['category'] == 'Title' and 'External links' in doc.page_content:
        parent_id = doc.metadata['element_id']
        external_docs.append(doc)
    # Check if the document's parent_id matches the ID of the title we found
    if doc.metadata.get('parent_id') == parent_id:
        external_docs.append(doc) # Add all child documents belonging to this title to the result list
for doc in external_docs:
    print(doc.page_content)

With that we complete our discussion on parsing, loading, and generating PDF files.

	`Cities`	`2020 Urban area`	`2010 Urban area`	`2020 City proper`
`1`	`Taiyuan`	`4,071,075`	`3,154,157`	`5,304,061`
`2`	`Datong`	`1,792,696`	`1,362,314`	`3,105,591`
`3`	`Changzhi`	`1,168,042`	`653,125`	`3,180,884`
`4`	`Jinzhong`	`900,569`	`444,002`	`3,379,498`
`5`	`Linfen`	`696,393`	`571,237`	`3,976,481`
`6`	`Yuncheng`	`692,003`	`432,554`	`4,774,508`
`7`	`Yangquan`	`647,272`	`623,671`	`1,318,505`
`8`	`Jincheng`	`574,665`	`476,945`	`2,194,545`
`9`	`Shuozhou`	`420,829`	`381,566`	`1,593,444`
`10`	`Xinzhou`	`384,424`	`279,875`	`2,689,668`
`11`	`Xiaoyi`	`337,489`	`268,253`	`see Lüliang`
`12`	`Lüliang`	`335,285`	`250,080`	`3,398,431`
`13`	`Jiexiu`	`291,393`	`232,269`	`see Jinzhong`
`14`	`Huairen`	`247,612`		`see Shuozhou`
`15`	`Gaoping`	`243,544`	`213,460`	`see Jincheng`
`16`	`Yuanping`	`227,046`	`202,562`	`see Xinzhou`
`17`	`Hejin`	`225,809`	`175,824`	`see Yuncheng`
`18`	`Fenyang`	`207,473`	`149,222`	`see Lüliang`
`19`	`Huozhou`	`183,575`	`156,853`	`see Linfen`
`20`	`Yongji`	`182,248`	`179,028`	`see Yuncheng`
`21`	`Houma`	`175,373`	`137,020`	`see Linfen`
`22`	`Gujiao`	`159,593`	`146,161`	`see Taiyuan`

RAG from First Principles: Engineering retrieval-augmented generation systems with Python, LangChain, and LlamaIndex

What do you get with eBook?

Contact Details

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Contact Details

Billing Address

Product Details

Packt Subscriptions

Table of Contents

Recommendations for you

About the author

FAQs

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access