Packt+ | Advance your knowledge in tech

You're reading from Natural Language Processing with Java Techniques for building machine learning and neural network models for NLP

Product type Paperback

Published in Jul 2018

Publisher

ISBN-13 9781788993494

Length 318 pages

Edition 2nd Edition

Languages

Java

Tools

Processing

Concepts

Machine Learning

Author (1):

Richard M. Reese

View More author details

Table of Contents (14) Chapters

Preface

1. Introduction to NLP FREE CHAPTER

2. Finding Parts of Text

3. Finding Sentences

4. Finding People and Things

5. Detecting Part of Speech

6. Representing Text with Features

7. Information Retrieval

8. Classifying Texts and Documents

9. Topic Modeling

10. Using Parsers to Extract Relationships

11. Combined Pipeline

12. Creating a Chatbot

13. Other Books You May Enjoy

Leave a review - let other readers know what you think

Using PDFBox to extract text from PDF documents

The Apache PDFBox (http://pdfbox.apache.org/) project is an API for processing PDF documents. It supports the extraction of text and other tasks, such as document merging, form filling, and PDF creation. We will only illustrate the text extraction process. To demonstrate the use of POI, we will use a file called TestDocument.pdf. This file was saved as a PDF document using the TestDocument.docx file, as shown in the Using POI to extract text from Word documents section. The process is straightforward. A File object is created for the PDF document. The PDDocument class represents the document and the PDFTextStripper class performs the actual text extraction using the getText method, as shown here:

File file = new File(getResourcePath());
PDDocument pd = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
String text= stripper.getText(pd);
System.out.println(text);

The output is as follows:

Jump to navigation Jump to search...

The rest of the chapter is locked

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

You're reading from Natural Language Processing with Java Techniques for building machine learning and neural network models for NLP

Table of Contents (14) Chapters

Using PDFBox to extract text from PDF documents

Authors (1)

Other recommended products

You're reading from Natural Language Processing with Java Techniques for building machine learning and neural network models for NLP

Table of Contents (14) Chapters

Using PDFBox to extract text from PDF documents

Authors (1)

Other recommended products

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access