commit 6c9bc1ec4f8b08250abae2dcdd866229b28a8727 from: Sven M. Hallberg date: Fri May 08 12:55:25 2020 UTC update todo list commit - f31903c8acf3479533aa88556b5632ef54159fe3 commit + 6c9bc1ec4f8b08250abae2dcdd866229b28a8727 blob - eae25e8a2040dec85d1ccbea20d0597990b0970a blob + cf50497b7e1996078704db487bda10174226587e --- TODO +++ TODO @@ -1,8 +1,40 @@ - - move main routine(s) into separate source file. - - move filter implementation(s) into separate source file. + - move main routine(s) and filter implementation(s) into separate source + files. e.g.: + - main.c: main function and helpers; starting from its include block + - pdf.c: parser proper; grammar and basic semantic actions + - filter.c: filters + - maybe another file just for xref or stream stuff? - - investigate memory use on big documents (millions of objects). + - refactor / clean up the (ascii) filter implementations. + - rework VIOL to produce a "violation" token in the AST (via h_action). then, + a validation (h_attr_bool) should let the parse fail if applicable (severity + vs. strictness). non-fatal violations should be extracted and printed to + stderr after the parse. + + - (maybe?) change stream parsing to just stop at "endstream endobj" when + /Length is indirect and the filter or postordinate parser doesn't delimit + itself. this is not strictly to-spec, but probably an OK restriction to make + in practice. a consistency checks can be made against the length after all + objects have been parsed. + + note: the current design aims to follow the spec to the letter in that the + /Length entry of a stream determines its length, and nothing else. from this + it follows that we must find and parse these lengths in "island style". + thus, the current code is a hybrid of linear and island parsing. if the + reliance on /Length can be broken, the island-based resolver can go and we + can have a proper split between two separate parsers - one pure linear and + one pure island. + + - parse and print content streams. + + - implement random-access ("island") parser (walking objects from /Root). + i'm not sure how much we need to know about the "DOM" for this. maybe + nothing? since everything is built out of basic objects and we can just + blindly follow references? + - check linear and random-access parses for consistency. + + - replace disparate parsing routines (applied to different pieces of input) with one big HParser that uses h_seek() to move around. this will enable packrat to cache, for instance, the xref tables instead of us parsing them @@ -11,19 +43,10 @@ - parse stream objects without reference to their /Length entry by simply trying all possible ways and consistency-checking them against the xref table in the end, via h_attr_bool(). + XXX is this actually possible (without unreasonable complications)? - - include position information, at least for objects, in the (JSON) output. - - format warnings/errors (stderr) as JSON, too. + - investigate memory use on big documents (millions of objects). - - make custom token types for all appropriate parts of the parse result. - - - parse content streams. - - - implement random-access parser (walking objects from /Root). - - check linear and random-access parses for consistency. - - - handle garbage before %PDF- and after %%EOF - - handle garbage at other points in the input? - - - add ASCII filter types. - - add LZW filter. + - make custom token types for all appropriate parts of the parse result so + that they can be properly distinguished in the output. + - include position information, at least for objects, in the (JSON) output.