Text

Structure

All text on the page is written first, broken down into columns, paragraphs, lines, and words. If there is any text on the page, the first children of the page element will be one or more column elements.
<column> <paragraph> <line> <word ...>...</word> <word ...>...</word> ... </line> <line> ... </line> ... </paragraph> <paragraph> ... </paragraph> ... </column> <column> ... </column> ...

Words

word elements have a number of attributes and child elements:
<word llx="lower-left-x" lly="lower-left-y" urx="upper-right-x" ury="upper-right-y" rot="rotation-angle" charpos="character-position" font="font-tag" fontSize="font-size" underlined="underlined" pos="position" spaceAfter="space-after" link="url"> <color type="rgb" r="red" g="green" b="blue"/> <text>word-text</text> </word>
The bounding box (llx, lly, urx, ury) uses the units specified with the -unit option.

The rotation angle is 0, 90, 180, or 270, indicating the text direction, in degrees counterclockwise from horizontal.

The character position is the index of the first character in the word, in content stream order. Character positions are used in highlight files generated by PDF search tools.

The font tag refers to a font in the resources element. The font size is in the units specified with the -unit option.

The underlined value will be either "true" or "false".

If the pos attribute is present, the position value will be one of:

For normal text, the pos attribute will not be present.

The space-after value, which is either "true" or "false", indicates whether there is a space after this word, before the next word on the same line. This is usually true, but it will be false in certain cases: on the last word in a line, between a word and a following subscript or superscript, between words where there is a change in font (PDFdeconstruct creates separate word elements for this), etc.

If this word is (part of) a URL-type link, the link attribute value will be set to the link target.

All colors are converted to RGB. The RGB components use the PDF/PostScript convention of real numbers between 0 and 1. For example, 50% gray ("#808080" in HTML terminology) would be:

<color type="rgb" r="0.5" g="0.5" b="0.5"/>
The text element contains the text of the word (converted to Unicode).

Characters

If the -chars option is used, word elements will have additional char children. There will be one char element for each character in the word:
<word ...> <color .../> <text>abc</text> <char llx="lower-left-x" lly="lower-left-y" urx="upper-right-x" ury="upper-right-y">a</char> <char llx="lower-left-x" lly="lower-left-y" urx="upper-right-x" ury="upper-right-y">b</char> <char llx="lower-left-x" lly="lower-left-y" urx="upper-right-x" ury="upper-right-y">c</char> </word>
The char elements provide the individual bounding boxes for each character in the word.