commit - f31903c8acf3479533aa88556b5632ef54159fe3
commit + 6c9bc1ec4f8b08250abae2dcdd866229b28a8727
blob - eae25e8a2040dec85d1ccbea20d0597990b0970a
blob + cf50497b7e1996078704db487bda10174226587e
--- TODO
+++ TODO
- - move main routine(s) into separate source file.
- - move filter implementation(s) into separate source file.
+ - move main routine(s) and filter implementation(s) into separate source
+ files. e.g.:
+ - main.c: main function and helpers; starting from its include block
+ - pdf.c: parser proper; grammar and basic semantic actions
+ - filter.c: filters
+ - maybe another file just for xref or stream stuff?
- - investigate memory use on big documents (millions of objects).
+ - refactor / clean up the (ascii) filter implementations.
+ - rework VIOL to produce a "violation" token in the AST (via h_action). then,
+ a validation (h_attr_bool) should let the parse fail if applicable (severity
+ vs. strictness). non-fatal violations should be extracted and printed to
+ stderr after the parse.
+
+ - (maybe?) change stream parsing to just stop at "endstream endobj" when
+ /Length is indirect and the filter or postordinate parser doesn't delimit
+ itself. this is not strictly to-spec, but probably an OK restriction to make
+ in practice. a consistency checks can be made against the length after all
+ objects have been parsed.
+
+ note: the current design aims to follow the spec to the letter in that the
+ /Length entry of a stream determines its length, and nothing else. from this
+ it follows that we must find and parse these lengths in "island style".
+ thus, the current code is a hybrid of linear and island parsing. if the
+ reliance on /Length can be broken, the island-based resolver can go and we
+ can have a proper split between two separate parsers - one pure linear and
+ one pure island.
+
+ - parse and print content streams.
+
+ - implement random-access ("island") parser (walking objects from /Root).
+ i'm not sure how much we need to know about the "DOM" for this. maybe
+ nothing? since everything is built out of basic objects and we can just
+ blindly follow references?
+ - check linear and random-access parses for consistency.
+
+
- replace disparate parsing routines (applied to different pieces of input)
with one big HParser that uses h_seek() to move around. this will enable
packrat to cache, for instance, the xref tables instead of us parsing them
- parse stream objects without reference to their /Length entry by simply
trying all possible ways and consistency-checking them against the xref
table in the end, via h_attr_bool().
+ XXX is this actually possible (without unreasonable complications)?
- - include position information, at least for objects, in the (JSON) output.
- - format warnings/errors (stderr) as JSON, too.
+ - investigate memory use on big documents (millions of objects).
- - make custom token types for all appropriate parts of the parse result.
-
- - parse content streams.
-
- - implement random-access parser (walking objects from /Root).
- - check linear and random-access parses for consistency.
-
- - handle garbage before %PDF- and after %%EOF
- - handle garbage at other points in the input?
-
- - add ASCII filter types.
- - add LZW filter.
+ - make custom token types for all appropriate parts of the parse result so
+ that they can be properly distinguished in the output.
+ - include position information, at least for objects, in the (JSON) output.