Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Python for Secret Agents - Volume II - Second Edition

You're reading from  Python for Secret Agents - Volume II - Second Edition

Product type Book
Published in Dec 2015
Publisher
ISBN-13 9781785283406
Pages 180 pages
Edition 2nd Edition
Languages
Authors (2):
Steven F. Lott Steven F. Lott
Profile icon Steven F. Lott
Steven F. Lott Steven F. Lott
Profile icon Steven F. Lott
View More author details

Chapter 4. Dredging up History

Parse PDF files to locate data that's otherwise nearly inaccessible. The web is full of PDF files, many of which contain valuable intelligence. The problem is extracting this intelligence in a form where we can analyze it. Some PDF text can be extracted with sophisticated parsers. At other times, we have to resort to Optical Character Recognition (OCR) because the PDF is actually an image created with a scanner. How can we leverage information that's buried in PDFs?

In some cases, we can use a save as text option to try and expose the PDF content. We then replace a PDF parsing problem with a plain-text parsing problem. While PDFs can seem dauntingly complex, the presence of exact page coordinates can actually simplify our efforts at gleaning information.

Also, some PDFs have fill-in-the-blanks features. If we have one of these, we'll be parsing the annotations within the PDF. This is similar to parsing the text of the PDF.

The most important consideration here...

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}