Table of Contents
Monsters from the wild
Some PDF writers produce PDFs which are not correct according to the specification. The term monster
refers to lakatosian monsters as coined by Imre Lakatos to refer to counterexamples of a theory.
Software trying to read real PDF files, cannot just throw an error when something is wrong. Instead, it should deal with wrong structures and try to use as much information as possible from the file.
Generally, situations like this will raise a proceedable specific error. Therefore, the error could be treated by the reading software, but could also be ignored if it is not important.
This page describes some of the problems encountered in real PDFs from the wild and discusses ways to deal with such situations.
Missing object
An attribute of an object has a reference pointing to a free reference in the cross references.
Example
Referencing indirect object (2 0): the /Outlines
1 0 obj << /Type /Catalog /Outlines 2 0 R /Pages 3 0 R >> endobj
The cross reference section
xref 0 7 0000000000 65535 f 0000000009 00000 n 0000000000 65535 f 0000000131 00000 n ... % 4 more
The reference to object (2 0) is the 3rd entry in the xref
table which is a free reference (0000000000 65535 f
). Therefore, object (2 0) cannot be accessed from the xref
table. It does not matter if the object is actually stored in the file or not.
Handling
A proceedable MissingObjectError
is raised. The reference points to a MissingObject
containing the expected type.
When writing out the reference to a new PDF, a string (The original object is missing)
is written as object instead (if type information is available, it is added to the message). This preserves the correct reference which may be used in several places. Subsequently, this will result in a type mismatch when reading that PDF (unless the original object was a string as well, which is unlikely).
Reference
Seen in /Info/Producer
: Microsoft® Excel® for Microsoft 365
and Bluebeam PDF Library 18
Incorrect stream length
The /Length
of a stream is different from the size of the content. Ths content are the bytes between token stream
, followed by one lf
, and token endstream
with optional extra whitespace before the token.
The following cases are possible:
/Length
is smaller than the content. Theendstream
token lies ahead./Length
is larger than the content. Theendstream
token, or parts of it, has been read as part of the stream already.
The particular monster where I encountered this, had always one byte too much in the content. Therefore, not the general problem was handled, but just the simple case where the content is exactly 1 larger than the number of bytes given by the /Length
attribute.
Example
42 0 obj <</Length 9>> stream abcdefghij endstream endobj
In the example, the stream contents in the file is abcdefghij
, a 10 byte string. But the /Length
attribute states 9 bytes. Therefore, the j
is extra. If the error is resumed, a stream with abcdefghi
will be created.
Handling
The library handles one specific instance of this error: when there is exactly one byte too much between stream
and endstream
(trailing whitespace is ignored). Then, a proceedable error (ExtraCharacterInStreamError
) is raised with the extra character as parameter. When proceeding, a stream will be created with the /Length
given number of content bytes and the extra byte is discarted.
If there are more bytes extra, a ReadError
is raised and no stream object is created. The error may be proceeded, but if the stream is used later, another error will occur.
Known problem
The general problem has not been adressed. One idea is to find the end of the stream content of the current object. With this information it is possible to determine if the /Length
entry is too small or too big and what to do about it.
The end of the stream would be before the endobj
and endstream
tokens before the start of the next object. The cross reference table has the offsets of all objects in consecutive order. Thus, the next object after the current is given by the next reference in the table. If the current object is the last object in the PDF, the end of the object is before the xref
token starting the cross reference table.
Object streams need not be considered, because they cannot contain streams.
This should be easy for the simple case of only one xref table. But handling several xrefs from different updates deemed too complex at the time (that's why I write this here as a reminder for the next time I need to deal with this problem).
Reference
Seen in /Info/Producer
: Bluebeam PDF Library 18