XpdfText

The XpdfText® library/component extracts plain text from PDF files. The PDF file can be on disk or in memory, and likewise, the text can be extracted to memory or directly to disk.

XpdfText can be used in different ways:

Convert entire PDF files or individual pages to plain text
- maintaining layout, or
- converting to "reading order"
Extract text from a specified rectangle on a page
- useful for extracting text from forms
Convert pages into word lists – for each word, you can retrieve:
- font name and font size
- text color
- word position on the page
- character offset (for highlight files)

The extracted text can be converted to a wide choice of standard encodings, including UTF-8 Unicode, ISO-8859-1 (Latin-1), 7-bit ASCII, and various other language-specific encodings.

The XpdfText library also includes all of the functionality of XpdfInfo.

XpdfText is easy to use:

pdf = new XpdfText.XpdfText pdf.loadFile("input.pdf") ' convert to a text file on disk... pdf.convertToTextFile(1, 5, "output.txt") ' ... or convert in memory s = pdf.convertToTextString(1, 5)

Supported platforms:

Windows: DLL
Windows: COM component - usable from .NET, Visual Basic, Delphi, etc.
Mac OS X: shared library
Linux: shared library
32-bit and 64-bit versions available for all platforms
other platforms: portable C++ source code for the library is available

See also: For content extraction to XML (instead of plain text), try our PDFdeconstruct tool.

Contact Glyph & Cog for more information, including pricing, documentation, and evaluation copies.

XpdfText

Quick Links