For most RAG systems, parsing PDF files is a key step in building the system. PDF files not only contain textual information but may also include elements such as tables and images, making their parsing more challenging compared to other types of documents.
Currently, common parsing methods for handling PDF files can be roughly divided into three categories: rule-based parsing , deep learning-based parsing , and multimodal large model-based parsing .
PDF parsers that use these methods may perform the following operations:
Recombine separated text boxes into logical units, such as lines or paragraphs, through heuristic methods or machine learning techniques
Apply OCR technology to images in the file to recognize and extract the text within them
Classify textual content to determine whether it belongs to paragraphs, lists, tables, or other structures
Organize the extracted text into table formats, or present data in key-value pairs
Tips from Lewis
Many modern large models now support multimodal input and can directly process multimedia files such as images and PDFs. We already provided an example earlier.
In some application scenarios, especially those requiring question answering and analysis of PDF documents with complex layouts, charts, or illustrations, the PDF documents can be passed directly to the large model for understanding, without first converting them into a simpler format.
Using loading tools for PDF files
LangChain can integrate with a variety of PDF parsers. Among these parsers, some are designed to be simple and relatively basic, suitable for lightweight text parsing scenarios, while others support OCR functions , mathematical formula processing and image analysis , or can perform advanced document layout analysis .
Parser
Description
Package/API
Features
PyPDF
Loads and parses PDF files using pypdf
Package
Efficient and lightweight, suitable for handling simple PDF files
Unstructured
Loads PDF files using the open-source Unstructured tool library
Package/API
Supports multiple document formats, has content extraction and analysis capabilities
Amazon Textract
Loads PDF files via AWS API
API
Provides cloud service support, suitable for large-scale document OCR processing
Mathpix
Loads and parses PDF files using MathPix
API
Specially designed for mathematical formulas, can accurately parse complex content
PDFPlumber
Loads PDF files using PDFPlumber
Package
Provides rich PDF content control and processing functions
PyPDFDirectory
Loads PDF files in directories
Package
Supports batch loading, convenient for processing multiple PDF files
PyPDFium2
Loads PDF files using PyPDFium2
Package
Efficient parsing, supports rendering and conversion of PDF pages
PyMuPDF
Loads PDF files using PyMuPDF
Package
Speed-optimized, supports fine-grained processing of complex PDF files
PDFMiner
Loads PDF files using PDFMiner
Package
Suitable for text extraction, especially adept at handling PDFs containing embedded text
Table 1.2: Parser functions and their features
The difficulty of deploying and using these tools can be analyzed from several perspectives, such as the following:
Local deployment type : Tools such as PyPDF, PDFPlumber, and PDFMiner are all Python libraries, so their installation and use are relatively simple. Such tools usually only require installation via pip or other package managers to get started quickly, making them suitable for users who want to avoid complicated configuration processes.
API service type : Amazon Textract and MathPix belong to this category and require applying for an API Key, often involving paid usage. Although these services offer powerful functions, such as batch document processing and mathematical formula parsing, their usage threshold is relatively high.
Hybrid type : Unstructured, as an open-source library, can be used directly for its basic functions, but fully utilizing all its features may require additional service support.
From the perspective of functional characteristics, PyPDF is a lightweight tool that provides basic PDF text extraction; PDFPlumber excels at processing table data and shows strong capability in layout analysis; PyMuPDF offers comprehensive functions, supporting PDF rendering, editing, and fine-grained processing of complex documents; Amazon Textract has OCR capabilities and is especially suitable for scanned documents; MathPix is designed for mathematical formula recognition; PDFMiner has very powerful underlying parsing capabilities, being able to precisely locate text positions.
From the performance perspective, PyPDF is known for its fast processing speed but performs only moderately in terms of accuracy; PyMuPDF excels in both performance and accuracy; Unstructured performs well in handling complex layouts, and thus has been chosen as the default loader for LangChain’s DirectoryLoader . Its API version also provides high-accuracy parsing services, though network conditions may affect processing speed.
From the perspective of application scenarios, for simple document text parsing, PyPDF is sufficient; if you need to process table data in PDF files, PDFPlumber is recommended; when faced with complexly formatted PDF files, PyMuPDF or Unstructured are better choices; for dealing with scanned documents, Amazon Textract is an ideal choice; for math PDF documents with many formulas, MathPix is recommended.
Overall, if you need to process PDF files in bulk at scale, PyMuPDF stands out as particularly well balanced due to its comprehensive features and high efficiency.
PDF parsing is a broad topic. Next, we can further explore some basic approaches. By learning and practicing the specific implementation details of each tool, mastering them will no longer be difficult.
Simple text extraction with PyPDFLoader
If you only need to extract the embedded text as simple string representations from a PDF file, you can use the PyPDFLoader method. This method returns a list of Document objects, with each page corresponding to a Document object. The extracted text will be stored in the page_content attribute of the Document object.
Figure 1.14: Two traditional statues are displayed side by side each depicting a robed figure with unique features
The following code example demonstrates how to install the PyPDF tool:
pip install pypdf
This method does not parse images or scanned PDF pages (i.e., it does not support OCR functionality), as shown in the following code:
from langchain_community.document_loaders import PyPDFLoader
file_path = "data/black myth/ Kang Jinlong and Lou Jingou.pdf"
loader = PyPDFLoader(file_path)
pages = loader.load()
print (f"Loaded { len (pages)} page(s) of PDF document" )
for page in pages:
print (page.page_content)
The preceding code returns standard text content:
Loaded 1 page(s) of PDF document
Some characters in the game, such as Kang Jinlong (left) and Lou Jingou (right), draw their inspiration from the painted sculptures at the Jade Emperor Temple in Jincheng, Shanxi. The character Kang Jinlong appears in both human and dragon form, serving as a boss enemy.
In the next section, we will see how to convert PDF documents to Markdown format.
Using the Marker tool to convert PDF documents to Markdown format
When dealing with PDF documents containing structured content, such as hierarchical headings used to organize and logically structure the content, the ideal approach is to preserve this hierarchy while parsing the PDF document. For such needs, converting PDF documents to Markdown format is a good choice. Standardizing all types of text to Markdown format helps simplify subsequent processing and analysis steps.
Figure 1.15: Wikipedia page about Yungang Grottoes with a photo of large Buddha statues carved in rock
In the process of converting PDF documents to Markdown format, at a minimum, the following key elements should be preserved:
Markdown heading hierarchy : In the document, headings are used to organize the content of different sections. This hierarchy not only helps learners navigate quickly, but also enhances the comprehensibility, overall readability, and organization of the document.
Image and text structure : Given that many PDF documents contain charts or multi-column layouts to further explain textual content, it is necessary to parse images during the conversion process and save them in a dedicated image directory, then embed them in the Markdown file. Meanwhile, appropriate formatting should be used to retain the presentation of tabular data, ensuring the accuracy and completeness of information delivery.
The following figure shows a document about Yungang Grottoes listing UNESCO site details in a table:
Figure 1.16: Screenshot of a document about Yungang Grottoes listing UNESCO site details in a table
Lewis : Although there are many open-source or commercial tools for converting PDF to Markdown, I personally recommend a tool with a very good user experience—Marker (another highly regarded tool is Docking ). Marker can effectively remove headers, footers, and other irrelevant content, support formatting of tables and code blocks, and convert most formulas into LaTeX format, which is especially useful for scientific papers. In addition, it can accurately extract images. The core of Marker comprises a series of deep learning models, specifically designed for text extraction, OCR, page layout detection, and formatting cleaning. Marker intelligently selects the most suitable model according to the specific format of the PDF file to be parsed, ensuring the optimal balance between parsing speed and accuracy.
Alex : Lewis, this is the second type of PDF parsing method you mentioned at the beginning of this section – Deep Learning-based parsing .
The following code example demonstrates how to use Marker to parse PDF documents.
First, install Marker with the following command:
pip install marker-pdf
Next, you can directly use the command line to parse PDF files:
marker_single ''data/Shanxi Cultural Tourism/Yungang Grottoes-en.pdf''
In addition, you can also use the following code example to parse PDF files (refer to https://github.com/PacktPublishing/RAG-from-First-Principles for the complete code):
import os
import subprocess
def convert_pdf_to_markdown (input_pdf_path, output_folder, batch_multiplier= 2 , max_pages= 12 ):
if not os.path.exists(output_folder):
os.makedirs(output_folder)
command = [
'marker_single' ,
input_pdf_path,
output_folder,
f'--batch_multiplier= {batch_multiplier} ' ,
f'--max_pages= {max_pages} '
]
try :
subprocess.run(command, check=True )
print (f"PDF document successfully converted to Markdown format, files saved to {output_folder} " )
except subprocess.CalledProcessError as e:
print (f"PDF document conversion failed: {e} " )
if __name__ == "__main__" :
input_pdf_path = "data/Shanxi Cultural Tourism/Yungang Grottoes-en.pdf"
output_folder = "data/marker/output/Yungang Grottoes-en"
convert_pdf_to_markdown(input_pdf_path, output_folder)
The parsing operation returns the following output:
Loaded detection model **vikp/surya_det3** on device **cuda** with dtype **torch.float16**
Loaded detection model **vikp/surya_layout3** on device **cuda** with dtype **torch.float16**
Loaded reading order model **vikp/surya_order** on device **cuda** with dtype **torch.float16**
Loaded recognition model **vikp/surya_rec2** on device **cuda** with dtype **torch.float16**
Loaded texify model to **cuda** with **torch.float16** dtype
Loaded recognition model **vikp/surya_tablerec** on device **cuda** with dtype **torch.float16**
Detecting bboxes: 100%|█████████████| 1/1 [00:01<00:00, 1.10s/it]
Recognizing Text: 100%|█████████████| 4/4 [03:18<00:00, 49.75s/it]
Detecting bboxes: 100%|█████████████| 1/1 [00:02<00:00, 2.16s/it]
Finding reading order: 100%|██████████| 1/1 [00:00<00:00, 2.05it/s]
Recognizing tables: 100%|████████████| 1/1 [00:07<00:00, 7.89s/it]
Saved markdown to the `data/marker/output/Yungang-Grottoes-en` folder
Total time: 238.9289002418518
PDF document was successfully converted to Markdown format, file has been saved to `data/marker/output`
After running the program, a Markdown file will be generated, along with a series of parsed PNG image files and a JSON file containing metadata information.
Figure 1.17: A file explorer shows two folders with images and documents, mainly PNGs and a JSON file
Open the Yungang Grottoes-en.md file, and you can see the expected content.
Figure 1.18: Screenshot compares Markdown code on the left with its formatted preview of the Yungang Grottoes on the right
Marker has achieved automatic conversion from PDF documents to Markdown format, accurately preserving the formatting information from the original PDF document. Additionally, it provides flexible configuration options, including batch processing and page number limits, allowing users to adjust performance and resource usage according to their needs.
This Markdown-format document can not only be read as a Document object by frameworks such as LangChain or LlamaIndex, but also serve as the original material for RAG-based knowledge bases.
Structured parsing with UnstructuredLoader
In the previous section, we explored how to use Marker to parse PDF documents into Markdown format. However, in some cases, simply relying on this conversion method may not meet all requirements. For example, when it’s necessary to segment the text at a finer granularity (such as dividing by paragraphs, headings, or table structures), or to extract text from images containing text, more detailed approaches are required.
As we know, the UnstructuredLoader provided by LangChain returns a list composed of Document objects. Each Document object represents an independent structure or element on the page and contains rich metadata, which greatly facilitates subsequent document analysis and processing.
The following figure shows a JSON file containing the description of the Yungang Grottoes’ coordinates and layout info:
Figure 1.19: A screenshot of JSON data describing the Yungang Grottoes coordinates and layout info
In such a data structure, not only is basic metadata (such as page number and text content) preserved, but also detailed layout information (such as the coordinates and types of elements):
Basic metadata : Includes page_number (page number information), category (used to distinguish different types of content, such as “NarrativeText” for narrative text), content (text content, for example, basic information describing the Yungang Grottoes), element_id (the unique identifier of the element), and parent_id (the identifier of the parent element, which helps to understand the hierarchical structure of the document and facilitates the structured processing of document content).
Layout information (coordinates) : Includes the coordinates of the top-left, bottom-left, top-right, and bottom-right corners, the coordinate system type, and page width and height information. With these coordinates, precise layout analysis can be performed to determine the exact position of the content, extract content from specific areas, or reconstruct the visual layout of the document. This supports location-based content filtering, sorting, and layout reconstruction and rearrangement.
Large models with powerful capabilities may be able to automatically achieve perfect PDF layout restoration based on the above information, thereby greatly improving the efficiency and accuracy of document processing.
Alex : Lewis, I don’t understand this part. Since we have to restore it sooner or later, why are we putting so much effort into breaking down the PDF document?
Lewis : This means you haven’t truly understood the RAG system. The purpose of breaking it down is to make the subsequent vectorization process more refined; only by splitting the document into independent elements can we accurately retrieve information in the RAG system according to the user’s query. For example, when a user asks what year the Yungang Grottoes were built, we need to quickly locate the relevant document fragment, rather than passing the entire PDF document to the large language model. That way, it would not only be less accurate but also waste token resources. The restoration during the generation process is for ultimately presenting a complete, visually rich answer to the user.
In addition to the previously installed langchain-unstructured interface package, since we are going to demonstrate how to call the Unstructured tool through an API here, we also need to obtain an Unstructured API Key and set the UNSTRUCTURED_API_KEY environment variable. The following code example shows how to use UnstructuredLoader to parse a PDF document:
file_path = ("data/ Shanxi Cultural Tourism/Yungang Grottoes-en.pdf" )
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
file_path=file_path,
strategy="hi_res" ,
partition_via_api=True ,
coordinates=True ,
)
docs = []
for doc in loader.lazy_load():
docs.append(doc)
The generated doc files contain not only the text content but also structural information.
Tips from Lewis
strategy="hi_res" means “high-resolution mode.” It attempts to capture more details in the document, especially complex layouts, tables, coordinates, and similar content.
When using this strategy to process documents, it is suitable for PDF files with very complex formats, such as files containing images, tables, graphics, and multiple columns of text. This mode leverages more advanced technologies, such as OCR analysis and high-resolution image parsing.
If the fast mode is used instead (setting strategy=”fast”, which is the default mode), processing is faster and consumes fewer resources. It can complete basic parsing tasks, but its support for complex layouts is less detailed, making it more suitable for PDF documents with relatively simple formats.
Next, extract the document structure of each page using the following function:
def extract_basic_structure (docs ):
""" Basic structure extraction: organize content by document type"""
category_map = {
'Title' : 'title' ,
'NarrativeText' : 'text' ,
'Image' : 'image' ,
'Table' : 'table' ,
' Footer' : 'footer' ,
'Header' : 'header'
}
structure = {cat: [] for cat in category_map.values()}
structure['metadata' ] = []
for doc in docs:
category = doc.metadata.get('category' , 'Unknown' )
content = {
'content' : doc.page_content,
'page' : doc.metadata.get('page_number' ),
'coordinates' : doc.metadata.get('coordinates' )
}
target_category = category_map.get(category)
if target_category:
structure[target_category].append(content)
return structure
structure = extract_basic_structure(docs)
Output the content with metadata as Title on page 2.
print ('Title on page 2:' )
for title in [t for t in structure['title' ] if t['page' ] == 2 ]:
print (f'- {title[ "content" ]} ' )
If you observe this page of the PDF document, you can see the title “Deterioration and Conservation” . The document’s parsing of title-type elements is accurate in this instance. Of course, sometimes the Unstructured tool may also misclassify elements, categorizing items that do not appear to be titles as titles.
The following code example can be used to display the layout of all elements on a page:
def analyze_layout (docs ):
'''''Analyze the document's layout structure''''''
layout_analysis = {}
for doc in docs:
page = doc.metadata.get('page_number')
coords = doc.metadata.get('coordinates', {})
# Initialize page information
if page not in layout_analysis:
layout_analysis[page] = {
'elements': [],
'dimensions': {
'width': coords.get('layout_width', 0),
'height': coords.get('layout_height', 0)
}
}
# Get element position information
points = coords.get('points', [])
if points:
# Only need the top-left and bottom-right coordinate points
(x1, y1), (_, _), (x2, y2), _ = points
# Construct element information
element = {
'type': doc.metadata.get('category'),
'content': (doc.page_content[:50] + '...') if len(doc.page_content) > 50 else doc.page_content,
'position': {
'x1': x1, 'y1': y1,
'x2': x2, 'y2': y2,
'width': x2 - x1,
'height': y2 - y1
}
}
layout_analysis[page]['elements'].append(element)
return layout_analysis
## Call the Function to Analyze the Document Layout
layout = analyze_layout(docs)
Next, output the content of the page layout on page 1:
print ("Page 1 layout analysis:" )
if 1 in layout:
page = layout[1 ]
print (f"Page size: {page[ 'dimensions' ][ 'width' ]} x {page[ 'dimensions' ][ 'height' ]} " )
print ("\nElement distribution:" )
for elem in sorted (page['elements' ], key=lambda x: x['position' ]['y1' ]):
print (f"\nType: {elem[ ' type' ]} " )
print (f"Position: ( {elem[ 'position' ][ 'x1' ]: .0 f} , {elem[ 'position' ][ 'y1' ]: .0 f} )" )
print (f"Size: {elem[ 'position' ][ 'width' ]: .0 f} x {elem[ 'position' ][ 'height' ]: .0 f} " )
print (f"Content: {elem[ 'content' ]} " )
The preceding code sorts and displays elements by vertical position:
Page 1 layout analysis:
Page size: 1700 x 2200
Element distribution:
Type : Header
Position: (827 , 41 )
Size: 304 x 30
Content: Yungang Grottoes - Wikipedia
Type : Image
Position: (98 , 104 )
Size: 427 x 142
Content: 4y WIKIPEDIA [ 1 The Free Encyclopedia WIKIPEDIA
Type : Title
Position: (1120 , 411 )
Size: 326 x 43
Content: Yungang Grottoes......
Document elements may have parent-child relationships (for example, a paragraph may belong to a section with a heading), and you can determine whether it belongs to the target section by checking each element’s category and content. For instance, to extract content under a specific heading (such as the introduction about “Cave 6” on page 3, as shown in the following figure), you can use the code example that follows.
Figure 1.20: Air pollutant studies at Yungang Grottoes and Cave 6’s art
To extract content under a specific heading, you can use the code example that follows:
cave6_docs = []
parent_id = -1
for doc in docs:
if doc.metadata["category" ] == "Title" and "Cave 6" in doc.page_content:
parent_id = doc.metadata["element_id" ]
if doc.metadata.get("parent_id" ) == parent_id:
cave6_docs.append(doc)
for doc in cave6_docs:
print (doc.page_content)
Here’s how the site description would appear in the output:
Cave 6 is one of the richest of the Yungang sites. It was constructed between 465 and 494 C.E. by The entire Emperor Xiao Wen. The cave's surface area is approximately 1,000 square meters. interior of the cave is carved and painted. There is a stupa pillar in the center of the room extending from the floor to the ceiling. The walls are divided into two stories. The walls of the upper stories are host to carvings of standing Buddhas, Bodhisattvas, and monks among other celestial figures. All of the carvings were painted, but because the caves have been repainted evidently up to twelve times, determining the original scheme is difficult.
Visualizing layout with PyMuPDF and coordinate information
So far, we have obtained detailed element information with coordinates . Next, we can use this data to perform fine-grained layout analysis and visualization.
In this section, we will combine the PyMuPDF library and the coordinate information parsed by UnstructuredLoader to visualize PDF pages and label content regions (such as titles, images, and tables) for easier understanding of the PDF page’s layout structure or for processing information about specific sections.
Tips from Lewis
PyMuPDF is a library widely used for PDF document operations, supporting efficient reading, modification, and rendering of PDF files. PyMuPDF can open and read PDF documents, extract text and images from pages, and access page layout details (such as paragraph coordinates and image positions). At the same time, it also supports converting PDF pages into bitmap formats and allows operations such as scaling and rotation. In addition, PyMuPDF supports modifying PDF documents. You can add text, images, or graphical elements to a PDF and annotate existing documents, such as highlighting text or adding comments, so that PDF layout analysis results can be displayed in other applications.
The following code demonstrates how to use PyMuPDF to read a PDF page and convert it to an image, then use matplotlib to draw the PDF page and add rectangular boxes to mark section regions. Different box colors are set according to the section category (such as “Title”, “Image”, “Table”):
import fitz
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from PIL import Image
def render_pdf_page (file_path, doc_list, page_number ):
pdf_doc = fitz.open (file_path)
pdf_page = pdf_doc.load_page(page_number - 1 )
segments = [doc.metadata for doc in doc_list if doc.metadata.get('page_number' ) == page_number]
pix = pdf_page.get_pixmap()
pil_image = Image.frombytes('RGB' , [pix.width, pix.height], pix.samples)
fig, ax = plt.subplots(figsize=(10 , 10 ))
ax.imshow(pil_image)
category_to_color = {'Title' : 'orchid' , 'Image' : 'forestgreen' , 'Table' : 'tomato' }
categories = set ()
for segment in segments:
points = segment['coordinates' ]['points' ]
layout_width = segment['coordinates' ]['layout_width' ]
layout_height = segment['coordinates' ]['layout_height' ]
category = segment.get('category' , 'Other' )
color = category_to_color.get(category, 'gold' )
x0, y0 = points[0 ]
x1, y1 = points[2 ]
width = x1 - x0
height = y1 - y0
rect = patches.Rectangle((x0, y0), width, height, linewidth=2 , edgecolor=color, facecolor='none' )
ax.add_patch(rect)
ax.text(x0, y0 - 5 , category, fontsize=10 , color=color, weight='bold' , backgroundcolor='white' )
categories.add(category)
plt.axis('off' )
plt.show()
In the preceding code, the function render_pdf_page will open the given PDF file, render the specified page using PyMuPDF, and draw rectangles on regions specified by your element metadata. Each type of region (title, image, table, etc.) receives a different color for visual clarity. Make sure the coordinate information in your doc_list matches the coordinate system of the PDF rendering for accurate annotation.
Here’s a nice code example for scaling layout coordinates, and generating a matching visualization legend:
layout_height = segment['coordinates' ]['layout_height' ]
scaled_points = [(x * pix.width / layout_width, y * pix.height / layout_height) for x, y in points]
box_color = category_to_color.get(segment['category' ], 'deepskyblue' )
categories.add(segment['category' ])
rect = patches.Polygon(scaled_points, linewidth=1 , edgecolor=box_color, facecolor='none' )
ax.add_patch(rect)
legend_handles = [patches.Patch(color='deepskyblue' , label='Text' )]
for category, color in category_to_color.items():
if category in categories:
legend_handles.append(patches.Patch(color=color, label=category))
ax.axis('off' )
ax.legend(handles=legend_handles, loc='upper right' )
plt.tight_layout()
Since the original paragraph coordinates are based on the layout ratio of the PDF page, they need to be scaled according to the actual pixel width and height of the page. After specifying the page number, the program will filter out the paragraphs that belong to that page from the document list and draw the annotation boxes for those paragraphs on the page.
You can call the preceding function for displaying the layout using the following code:
render_pdf_page(file_path,docs, 1 )
The PDF is generated:
Figure 1.21: A textbook page with photos, diagram, and text discussing the Yungang Grottoes in China
Alex : Oh, I see. According to the layout, we can organize similar types of information together. For example, we can pass all the elements in the green image layout group in the figure to the large model as a whole, in order to generate question and answer content related to the image (Yungang Grottoes). Without precise layout analysis, this would be quite difficult.
Lewis : Really smart.
Using UnstructuredLoader to parse tables in PDF pages
Next, let’s look at reading table information from PDF pages. The PDF document discussed previously did not contain any tables, so we will switch to a PDF document that includes tables. The 12th page of this file contains data about major cities in Shanxi Province (data sourced from Wikipedia).
First, we will use the same method as before to parse the layout elements of page 12 of the PDF document, shown in the following figure, by calling the statement render_pdf_page(file_path,docs, 12) and visualizing them.
Figure 1.22: A table displays urban populations of cities in Shanxi, China, for 2020 and 2010 plus city proper
The highlighted element in the table indicates that the entire table has been successfully parsed and that the element type is Table:
Figure 1.23: Table listing urban populations for cities in Shanxi, China, with 2020 and 2010 data
Next, we display the metadata of all the elements on page 12:
page_number = 12
page_docs = [doc for doc in docs if doc.metadata.get('page_number' ) == page_number]
for doc in page_docs:
print ('Metadata:' )
for key, value in doc.metadata.items():
print (f' {key} : {value} ' )
A portion of the output is shown here:
Figure 1.24: Screenshot of metadata for three PDF elements showing file path IDs and categories
Although a lot of metadata information is output here, the key point is that we can see the category in the metadata contains Table, and this Table element has a parent_id. The parent_id links to the table’s title. This is very important because a table cannot exist independently from its associated title. The table element may contain only numbers, while the table title might indicate the meaning of these numbers.
For example, when comparing the GDP of two groups of cities in Shanxi Province, if one table is titled “2024 GDP of Each City” and another table is titled “2025 GDP of Each City”, you must link the elements within the table to those of their corresponding titles. This process is a necessary step in your RAG system. Otherwise, simply possessing the figures without knowing their corresponding years will lead to retrieval results that lack accuracy.
Integrating content under the same title using ParentID
If you need to integrate a table with the title text above it, you can achieve this by following these steps:
Filter by page_number: Filter out all elements on a specific page (such as page 12)
Classify by category: Identify elements of types Table and Title, and determine whether the Title is above the Table (by comparing their y coordinate values)
Integrate tables and titles: Combine the table with its nearest title into one structure and output the integrated information
Alex : However, this logic seems a bit complicated to implement.
Lewis : Indeed. Since the Unstructured tool automatically saves parent-child relationships, a more direct approach is to locate elements with the category “Table”. For each table, find its parent element corresponding to the parent_id, and then output the combined content of the table and its parent element.
Next, use the following function to automatically locate the sub-elements of a table and its parent element and output them as a whole:
def find_tables_and_titles (docs ):
results = []
for doc in docs:
if doc.metadata.get('category' ) == ' Table' :
table = doc
parent_id = doc.metadata.get('parent_id' )
title = next ((doc for doc in docs if doc.metadata.get('element_id' ) == parent_id), None )
if title:
results.append({' table' : table.page_content, 'title' : title.page_content})
return results
results = find_tables_and_titles(page_docs)
if results:
for result in results:
print ('Found table and title:' )
print (f'Title: {result[ ' title' ]} ' )
print (f'Table: {result[ 'table' ]} ' )
else :
print ('No tables and titles found' )
The output gives a list of urban areas:
Cities
2020 Urban area
2010 Urban area
2020 City proper
1
Taiyuan
4,071,075
3,154,157
5,304,061
2
Datong
1,792,696
1,362,314
3,105,591
3
Changzhi
1,168,042
653,125
3,180,884
4
Jinzhong
900,569
444,002
3,379,498
5
Linfen
696,393
571,237
3,976,481
6
Yuncheng
692,003
432,554
4,774,508
7
Yangquan
647,272
623,671
1,318,505
8
Jincheng
574,665
476,945
2,194,545
9
Shuozhou
420,829
381,566
1,593,444
10
Xinzhou
384,424
279,875
2,689,668
11
Xiaoyi
337,489
268,253
see Lüliang
12
Lüliang
335,285
250,080
3,398,431
13
Jiexiu
291,393
232,269
see Jinzhong
14
Huairen
247,612
see Shuozhou
15
Gaoping
243,544
213,460
see Jincheng
16
Yuanping
227,046
202,562
see Xinzhou
17
Hejin
225,809
175,824
see Yuncheng
18
Fenyang
207,473
149,222
see Lüliang
19
Huozhou
183,575
156,853
see Linfen
20
Yongji
182,248
179,028
see Yuncheng
21
Houma
175,373
137,020
see Linfen
22
Gujiao
159,593
146,161
see Taiyuan
This way, we have successfully associated the data in the table with its header information. With the header information, we can retrieve relevant tables based on the user’s question.
Let’s look at another example of a parent-child relationship combination. In the following screenshot, you can see that there are a total of four child elements under the Title element External links . If you need to combine these four child elements with their parent element (i.e., the title) and output them as an integrated chunk, you can do so through the parent-child relationship.
The following code example demonstrates how to integrate these related pieces of information:
external_docs = []
parent_id = -1
for doc in docs:
if doc.metadata['category' ] == 'Title' and 'External links' in doc.page_content:
parent_id = doc.metadata['element_id' ]
external_docs.append(doc)
if doc.metadata.get('parent_id' ) == parent_id:
external_docs.append(doc)
for doc in external_docs:
print (doc.page_content)
With that we complete our discussion on parsing, loading, and generating PDF files.