Using PDFdeconstruct

Synopsis

Basic usage looks like:
pdfdeconstruct [options] PDF-file output-dir
For example:
pdfdeconstruct test.pdf testout
will create a directory called "testout", containing a "doc.xml" along with any extracted fonts and images.

The options described below can be used to modify the output.

The output format is described in detail in PDFdeconstruct Output.

Options

The following command line options are available:
-output type,type,...
Select the types of output to be included in the generated XML. The argument is a comma separated list (e.g., "-output outline,textext,vector"), including any of the following:
  • outline: include the document outline (aka bookmarks)
  • textext: include the extracted text (column/paragraph/line/word)
  • chars: include per-character information, in addition to per-word information, in the extracted text (ignored if textext is not included)
  • form: include the formfield elements
  • text: include the text drawing operations (textop elements)
  • image: include the image drawing operations (image and imagemask elements)
  • vector: include the vector drawing operations (fill and stroke elements)
  • struct: include the structure tree
The -output option overrides some older options: -outline, -chars, and -textops.
-outline
Include the document outline (aka bookmarks) in the XML output. (-output overrides this option.)
-chars
Output per-character information, in addition to per-word information. The default (without the -chars option) is to generate per-word information only. See PDFdeconstruct Output for details. (-output overrides this option.)
-textops
Output text drawing operations (textop elements), in addition to extracted text (column elements, etc.). See Drawing Operators for details. (-output overrides this option.)
-sepres
Output the info and resources elements in one XML file (doc.xml), and page elements in a second XML file (pages.xml). The default is to output everything (page, info, and resources elements) in doc.xml. See PDFdeconstruct Output for details.
-seppages
Output the info and resources elements in one XML file (doc.xml), and each page element in a separate XML file (pageNNNNNN.xml).
-unit unit:places
Set the unit and number of decimal places for position and size output. The unit can be any of:
  • "pt" (PostScript point = 1/72 inch)
  • "inch"
  • "mil" (mil = 0.001 inch)
  • "mm"
For example, "-unit inch:3" generates position output in the form "1.234" with a unit of inches. The default setting is "-unit pt:2".
-imagefmt format
Set the image file format to one of: "PNG", "TIFF", or "JPEG". All images will be converted to the specified format. The default is PNG.
-keepjpeg
Output JPEG image streams as JPEG files. "DCTDecode" images in a PDF file are in standard JPEG format. With this flag, all DCTDecode images will be copied directly to JPEG files in the output directory without decoding and re-encoding, regardless of the -imagefmt setting. (There is one exception: CMYK DCTDecode streams are always re-encoded, because many JPEG readers don't properly handle CMYK JPEG files.)
-nofields
Do not include form field values in the extracted text. Field values will still be included in formfield elements. This is useful if the XML consumer will be drawing form fields based on the formfield elements.
-nopatterns
With this option, tiling patterns will be rendered as a single fill or stroke operation, with <color type="pattern"/>. Without this option, i.e., by default, tiling patterns are reduced to the tile content, repeated for each cell.
-cleantt
Rewrite TrueType fonts to clean up certain errors. This will correct some known problems occasionally found in embedded TrueType fonts in PDF files.
-table
Extract the text in table mode, instead of reading order mode. This will generally split the text into smaller pieces.
-discardInvisible
Discard all "invisible" text, i.e., text drawn in the invisible rendering mode (which is most often used for OCR text).
-discardClipped
Discard all clipped text, i.e., text which is drawn outside of the clipping region and is therefore not visible.
-f page-number
Set the first page to convert. The -f and -l options can be used to select a range of pages smaller than the whole PDF file. The default is to convert the whole file ("-f 1 -l N").
-l page-number
Set the last page to convert. See the description of -f above.
-pw password
Set the password for an encrypted PDF file. This can be either the owner password or the user password. If the input PDF file is encrypted ("protected") with copying of text and graphics disabled, PDFdeconstruct will not run without the owner password.