Using PDFBox to extract text from PDF documents
The Apache PDFBox (http://pdfbox.apache.org/) project is an API for processing PDF documents. It supports the extraction of text and other tasks, such as document merging, form filling, and PDF creation. We will only illustrate the text extraction process. To demonstrate the use of POI, we will use a file called TestDocument.pdf. This file was saved as a PDF document using the TestDocument.docx file, as shown in the Using POI to extract text from Word documents section. The process is straightforward. A File object is created for the PDF document. The PDDocument class represents the document and the PDFTextStripper class performs the actual text extraction using the getText method, as shown here:
File file = new File(getResourcePath()); PDDocument pd = PDDocument.load(file); PDFTextStripper stripper = new PDFTextStripper(); String text= stripper.getText(pd); System.out.println(text);
The output is as follows:
Jump to navigation Jump to search...