StructureAll text on the page is written first, broken down into columns, paragraphs, lines, and words. If there is any text on the page, the first children of the
pageelement will be one or more
wordelements have a number of attributes and child elements:
ury) uses the units specified with the
The rotation angle is 0, 90, 180, or 270, indicating the text direction, in degrees counterclockwise from horizontal.
The character position is the index of the first character in the word, in content stream order. Character positions are used in highlight files generated by PDF search tools.
The font tag refers to a font in
resources element. The font size is in
the units specified with the
The underlined value will be either "
pos attribute is present, the position value will
be one of:
sub": the word is a subscript
super": the word is a superscript
dropcap": the word is a drop capital (at the start of a paragraph)
posattribute will not be present.
The space-after value, which is either "
false", indicates whether there is a space after this
word, before the next word on the same line. This is usually true,
but it will be false in certain cases: on the last word in a line,
between a word and a following subscript or superscript, between words
where there is a change in font (PDFdeconstruct creates separate word
elements for this), etc.
If this word is (part of) a URL-type link, the
attribute value will be set to the link target.
All colors are converted to RGB. The RGB components use the
PDF/PostScript convention of real numbers between 0 and 1. For
example, 50% gray ("
#808080" in HTML terminology) would
textelement contains the text of the word (converted to Unicode).
-charsoption is used,
wordelements will have additional
charchildren. There will be one
charelement for each character in the word:
charelements provide the individual bounding boxes for each character in the word.