Packt+ | Advance your knowledge in tech

You're reading from Python for Secret Agents - Volume II - Second Edition

Product typeBook

Published inDec 2015

Reading LevelIntermediate

Publisher

ISBN-139781785283406

Edition2nd Edition

Languages

Python

Concepts

Cybersecurity

Authors (2):

Steven F. Lott

View More author details

Chapter 4. Dredging up History

Parse PDF files to locate data that's otherwise nearly inaccessible. The web is full of PDF files, many of which contain valuable intelligence. The problem is extracting this intelligence in a form where we can analyze it. Some PDF text can be extracted with sophisticated parsers. At other times, we have to resort to Optical Character Recognition (OCR) because the PDF is actually an image created with a scanner. How can we leverage information that's buried in PDFs?

In some cases, we can use a save as text option to try and expose the PDF content. We then replace a PDF parsing problem with a plain-text parsing problem. While PDFs can seem dauntingly complex, the presence of exact page coordinates can actually simplify our efforts at gleaning information.

Also, some PDFs have fill-in-the-blanks features. If we have one of these, we'll be parsing the annotations within the PDF. This is similar to parsing the text of the PDF.

The most important consideration here...

Background briefing–Portable Document Format

The PDF file format dates from 1991. Here's a quote from Adobe's website about the format: it can capture documents from any application, send electronic versions of these documents anywhere, and view and print these documents on any machines. The emphasis is clearly on view and print. What about analysis?

There's an ISO standard that applies to PDF documents, assuring us that no single vendor has a lock on the technology. The standard has a focus on specific technical design, user interface or implementation or operational details of rendering. The presence of a standard doesn't make the document file any more readable or useful as a long-term information archive.

What's the big problem?

The Wikipedia page summarizes three technologies that are part of a PDF document:

A subset of the PostScript page description programming language, for generating the page layout and graphics
Font management within the document
A document storage structure, including...

Extracting PDF content

In Chapter 1, New Missions – New Tools, we installed PDF Miner 3K to parse PDF files. It's time to see how this tool works. Here's the link to the documentation for this package: http://www.unixuser.org/~euske/python/pdfminer/index.html. This link is not obvious from the PyPI page, or from the BitBucket site that contains the software. An agent who scans the docs/index.html will see this reference.

In order to see how we use this package, visit http://www.unixuser.org/~euske/python/pdfminer/programming.html. This has an important diagram that shows how the various classes interact to represent the complex internal details of a PDF document. For some helpful insight, visit http://denis.papathanasiou.org/2010/08/04/extracting-text-images-from-pdf-files/.

A PDF document is a sequence of physical pages. Each page has boxes of text (in addition to images and line graphics). Each textbox contains lines of text and each line contains the individual characters. Each of these...

Getting text data from a document

We'll need to add some more features to our class definition so that we can extract meaningful, aggregated blocks of text. We'll need to add some layout rules and a text aggregator that uses the rules and the raw page to create aggregated blocks of text.

We'll override the init_device() method to create a more sophisticated device. Here's the next subclass, built on the foundation of the Miner_Page and Miner classes:

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
class Miner_Layout(Miner_Page):
    def __init__(self, *args, **kw):
        super().__init__(*args, **kw)
    def init_device(self, resource_manager, **params):
        """Return an PDFPageAggregator as a device."""
        self.layout_params = LAParams(**params)
        return PDFPageAggregator(resource_manager, laparams=self.layout_params)
    def page_iter(self):
        """Yields a LTPage object for each page in the document."""
        for page in super...

Understanding tables and complex layouts

In order to work successfully with PDF documents, we need to process some parts of the page geometry. For some kinds of running text, we don't need to worry about where the text appears on the page. But for tabular layouts, we're forced to understand the gridded nature of the display. We're also forced to grapple with the amazing subtlety of how the human eye can take a jumble of letters on a page and resolves them into meaningful rows and columns.

It doesn't matter now, but as we move forward it will become necessary to understand two pieces of PDF trivia. First, coordinates are in points, which are about 1/72 of an inch. Second, the origin, (0,0), is the lower-left corner of the page. As we read down the page, the y coordinate decreases toward zero.

A PDF page will be a sequence of various types of layout objects. We're only interested in the various subclasses of LTText.

The first thing we'll need is a kind of filter that will step through an iterable...

Summary

We saw how we can tease meaningful information out of a PDF document. We assembled a core set of tools to extract outlines from documents, summarize the pages of a document, and pull the text from each page. We also discussed how we can analyze a table or other complex layout to reassemble meaningful information from that complex layout.

We used a very clever Python design pattern called wrap-sort-unwrap to decorate text blocks with coordinate information, and then sort it into the useful top-to-bottom and left-to-right positions. Once we had the text properly organized, we could unwrap the meaningful data and produce useful output.

We also discussed two other important Python design patterns: the context manager and the filter. We used object-oriented design techniques to create a hierarchy of context managers that simplify our scripts to extract data from files. The filter concept has three separate implementations: as a generator expression, as a generator function, and using the...

The rest of the chapter is locked

You have been reading a chapter from

Python for Secret Agents - Volume II - Second Edition

Published in: Dec 2015Publisher: ISBN-13: 9781785283406

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Steven F. Lott

Steven Lott has been programming since computers were large, expensive, and rare. Working for decades in high tech has given him exposure to a lot of ideas and techniques, some bad, but most are helpful to others. Since the 1990s, Steven has been engaged with Python, crafting an array of indispensable tools and applications. His profound expertise has led him to contribute significantly to Packt Publishing, penning notable titles like "Mastering Object-Oriented," "The Modern Python Cookbook," and "Functional Python Programming." A self-proclaimed technomad, Steven's unconventional lifestyle sees him residing on a boat, often anchored along the vibrant east coast of the US. He tries to live by the words “Don't come home until you have a story.”
Read more about Steven F. Lott

Steven F. Lott

Personalised recommendations for you

Based on your interests and search pattern

Attacking and Exploiting Modern Web Applications

Attacking and Exploiting Modern Web Attacks will help you understand how to identify attack surfaces and detect vulnerabilities. This book takes a hands-on approach to implementation and associated methodologies and equips you with the knowledge and skills needed to effectively combat web attacks.

BookAug 2023338 pages

Automotive Cybersecurity Engineering Handbook

This Automotive Cybersecurity Engineering Handbook untangles the complexities of building secure automotive products and helps you to comply with cybersecurity standards. It provides practical tools, tips, and techniques coupled with real-world examples to enable you to create cyber-resilient automotive products with ease.

BookOct 2023392 pages

Official Google Cloud Certified Professional Cloud Security Engineer Exam Guide

This book will help you to design, develop, and operate security controls on Google Cloud as well as discover best practices for relevant security domains, including identity and access management, network, and data.

BookAug 2023496 pages

Cloud Penetration Testing for Red Teamers

The advent of cloud networks and the AWS, Azure, and GCP platforms has revolutionized how companies of all sizes in all industries do business online. This book will help you meet the emerging demand for pentesting as it guides you through the tools, techniques, and security measures used by pentesters and red teamers in the 2020s and beyond.

BookNov 2023298 pages

ISACA Certified in Risk and Information Systems Control (CRISC®) Exam Guide

ISACA Certified in Risk and Information Systems Control (CRISC®) Certification Guide is an enterprise IT risk management professional’s dream. With its in-depth approach and various self-assessment exercises, this book arms you with knowledge of every single aspect of the certification, and is a fantastic career companion after you’re certified.

BookSep 2023316 pages5

Burp Suite Cookbook

Burp Suite is an immensely powerful and popular tool for web application security testing. This book provides a collection of recipes that address vulnerabilities in web applications and APIs. It offers guidance on how to configure Burp Suite, make the most of its tools, and explore into its extensions.

BookOct 2023450 pages

Building and Automating Penetration Testing Labs in the Cloud

This hands-on guide will help you design and build a variety of penetration testing labs that mimic modern cloud environments running on AWS, Azure, and GCP. In addition to these, you will explore a number of practical strategies on how to manage the complexity, cost, and security risks involved when setting up vulnerable cloud lab environments.

BookOct 2023562 pages

Ethical Hacking Workshop

As cyber-attacks grow and APT groups advance their skillset, you need to be able to protect your enterprise against cyber-attacks. In order to limit your attack surface, you need to ensure that you leverage the same skills and tools that an adversary may use to hack your environment and discover the security gaps. This book will teach you how to think like a hacker, use state-of-the-art hacking tools, and protect yourself and your organizaiton.

BookOct 2023220 pages

Windows Forensics Analyst Field Guide

This book contains step-by-step processes to guide you in any investigation related to Windows OS. You’ll find out how to acquire evidence using multiple tools as well as examine and analyze the collected artifacts, while discovering multiple techniques used in real-world forensic incidents.

BookOct 2023318 pages

Implementing DevSecOps Practices

This book is a comprehensive, hands-on guide for individuals new to DevSecOps who want to implement DevSecOps practices successfully and efficiently. With its help, you’ll be able to shift security toward the left, enabling you to merge security into coding in no time.

BookDec 2023258 pages