Using PDFdeconstruct

Synopsis

Basic usage looks like:

pdfdeconstruct [options] PDF-file output-dir

For example:

pdfdeconstruct test.pdf testout

will create a directory called "testout", containing a "doc.xml" along with any extracted fonts and images.

The options described below can be used to modify the output.

The output format is described in detail in PDFdeconstruct Output.

Options

The following command line options are available:

-output type,type,...

Select the types of output to be included in the generated XML. The argument is a comma separated list (e.g., "

-output
outline,textext,vector

"), including any of the following:

outline: include the document outline (aka bookmarks)
textext: include the extracted text (column/paragraph/line/word)
chars: include per-character information, in addition to per-word information, in the extracted text (ignored if textext is not included)
form: include the formfield elements
text: include the text drawing operations (textop elements)
image: include the image drawing operations (image and imagemask elements)
vector: include the vector drawing operations (fill and stroke elements)
struct: include the structure tree

The -output option overrides some older options: -outline, -chars, and -textops.

-outline

Include the document outline (aka bookmarks) in the XML output. (-output overrides this option.)

-chars

Output per-character information, in addition to per-word information. The default (without the -chars option) is to generate per-word information only. See PDFdeconstruct Output for details. (-output overrides this option.)

-textops

Output text drawing operations (textop elements), in addition to extracted text (column elements, etc.). See Drawing Operators for details. (-output overrides this option.)

-sepres

Output the info and resources elements in one XML file (doc.xml), and page elements in a second XML file (pages.xml). The default is to output everything (page, info, and resources elements) in doc.xml. See PDFdeconstruct Output for details.

-seppages

Output the info and resources elements in one XML file (doc.xml), and each page element in a separate XML file (pageNNNNNN.xml).

-unit unit:places

Set the unit and number of decimal places for position and size output. The unit can be any of:

"pt" (PostScript point = 1/72 inch)
"inch"
"mil" (mil = 0.001 inch)
"mm"

For example, "-unit inch:3" generates position output in the form "1.234" with a unit of inches. The default setting is "-unit pt:2".

-imagefmt format

Set the image file format to one of: "PNG", "TIFF", or "JPEG". All images will be converted to the specified format. The default is PNG.

-keepjpeg

Output JPEG image streams as JPEG files. "DCTDecode" images in a PDF file are in standard JPEG format. With this flag, all DCTDecode images will be copied directly to JPEG files in the output directory without decoding and re-encoding, regardless of the -imagefmt setting. (There is one exception: CMYK DCTDecode streams are always re-encoded, because many JPEG readers don't properly handle CMYK JPEG files.)

-nofields

Do not include form field values in the extracted text. Field values will still be included in formfield elements. This is useful if the XML consumer will be drawing form fields based on the formfield elements.

-nopatterns

With this option, tiling patterns will be rendered as a single fill or stroke operation, with

<color
type="pattern"/>

. Without this option, i.e., by default, tiling patterns are reduced to the tile content, repeated for each cell.

-cleantt

Rewrite TrueType fonts to clean up certain errors. This will correct some known problems occasionally found in embedded TrueType fonts in PDF files.

-table

Extract the text in table mode, instead of reading order mode. This will generally split the text into smaller pieces.

-discardInvisible

Discard all "invisible" text, i.e., text drawn in the invisible rendering mode (which is most often used for OCR text).

-discardClipped

Discard all clipped text, i.e., text which is drawn outside of the clipping region and is therefore not visible.

-f page-number

Set the first page to convert. The -f and -l options can be used to select a range of pages smaller than the whole PDF file. The default is to convert the whole file ("-f 1 -l N").

-l page-number

Set the last page to convert. See the description of -f above.

-pw password

Set the password for an encrypted PDF file. This can be either the owner password or the user password. If the input PDF file is encrypted ("protected") with copying of text and graphics disabled, PDFdeconstruct will not run without the owner password.