KHJK gotweb

Commit Diff

Commit:: 70053d655f13f7ecbea5789cf4ea505bfce21f75
From:: Pompolic <pompolic@special-circumstanc.es>
Date:: Wed May 20 15:18:43 2020 UTC
Message:: Merge branch 'master' of gitlab.special-circumstanc.es:pesco/pdf
Actions:: Patch | Tree
commit - 162e65e85fe6ac42d796fa3ce4fc19c237777069
commit + 70053d655f13f7ecbea5789cf4ea505bfce21f75
blob - eae25e8a2040dec85d1ccbea20d0597990b0970a
blob + 70321e71d3a5794d35fdb53f91189589073b3d02
--- TODO
+++ TODO
@@ -1,8 +1,91 @@
- - move main routine(s) into separate source file.
- - move filter implementation(s) into separate source file.
+ - fix the object stream parser to split input at logical boundaries, as
+   provided by the object index ("N pairs of integers") at the beginning of the
+   stream data.
 
- - investigate memory use on big documents (millions of objects).
+   this follows discussion with peter wyatt where he initially said that the
+   objects should be delimited by normal PDF token rules, but PDFA then came
+   to the conclusion that, in fact, this was a mistake and the logical
+   begin/end info should delimit things. i.e. if your index says that an object
+   begins at offset 0 and ends at offset 3, followed by one that ends at 6, and
+   the input is "123456", this parses as two numbers, 123 and 456.
 
+   currently the code follows the incorrect former approach, (re-) using the
+   "elemr" parser that is otherwise used with arrays. the above example would
+   parse as one element, the number 123456, in contradiction to the index
+   (which we parse but ignore).
+
+   we have to explicitly walk the index, run our "obj" parser on each
+   respective snippet of input, and wrap the results up in a parse result. we
+   should also validate conditions on the index beforehand. these are
+   thankfully sane (monotonic offsets etc.) and mentioned in the spec.
+
+ - move main routine(s) and filter implementation(s) into separate source
+   files. e.g.:
+   - main.c: main function and helpers; starting from its include block
+   - pdf.c: parser proper; grammar and basic semantic actions
+   - filter.c: filters
+   - maybe another file just for xref or stream stuff?
+
+ - refactor / clean up the (ascii) filter implementations.
+
+ - rework VIOL to produce a "violation" token in the AST (via h_action). then,
+   a validation (h_attr_bool) should let the parse fail if applicable (severity
+   vs. strictness). non-fatal violations should be extracted and printed to
+   stderr after the parse.
+ - somehow rid VIOL() of the internal parser for getting at the severity
+   parameter. this is, i guess, an artefact of h_action() taking a single void
+   pointer of context, so it was not trivial to pass two arguments (message and
+   severity) to the action.
+
+ - (maybe?) change stream parsing to just stop at "endstream endobj" when
+   /Length is indirect and the filter or postordinate parser doesn't delimit
+   itself. this is not strictly to-spec, but probably an OK restriction to make
+   in practice. a consistency checks can be made against the length after all
+   objects have been parsed.
+
+   note: the current design aims to follow the spec to the letter in that the
+   /Length entry of a stream determines its length, and nothing else. from this
+   it follows that we must find and parse these lengths in "island style".
+   thus, the current code is a hybrid of linear and island parsing. if the
+   reliance on /Length can be broken, the island-based resolver can go and we
+   can have a proper split between two separate parsers - one pure linear and
+   one pure island.
+
+ - parse and print content streams.
+ - parse/validate additional stream types/filters (images...).
+
+ - consider reviving the effort to get "obj" to parse with LALR. the messy
+   grammar for arrays with "elemd", "elemr", etc. still stems from project, as
+   does the explicit handling of whitespace -- note that TOK() is only used in
+   KW() and that no instances of KW() remain under "obj".
+
+   alternatively, consider fully reverting the grammar to its clearer PEG form.
+   i would probably keep the explicit whitespace, though.
+
+   what stopped me before was the difficulty to resolve some things without
+   precedence rules; specifically line endings in string literals.
+   is <CR><LF> a "crlf" or a "cr" followed by an "lf"? LALR cannot decide
+   unless you encode that anything following a "cr" doesn't start with <LF>.
+   string literals are currently defined differently. the best way to do it,
+   AFAICS, would be to match (in string literals) all subsequent line endings
+   in one nonterminal and to encode there that a plain "cr" is never followed
+   by "lf".
+
+   FWIW, the motivation for LALR parsing of "obj" was the prospect of parsing
+   an object stream incrementally, as chunks come in from the decompressor
+   (or an arbitrary filter chain).
+
+   NB: the reason why we must distinguish "crlf" from "cr" "lf" at all is of
+   course that in a string literal, the former means "\n" and the latter means
+   "\n\n".
+
+ - implement random-access ("island") parser (walking objects from /Root).
+   i'm not sure how much we need to know about the "DOM" for this. maybe
+   nothing? since everything is built out of basic objects and we can just
+   blindly follow references?
+ - check linear and random-access parses for consistency.
+
+
  - replace disparate parsing routines (applied to different pieces of input)
    with one big HParser that uses h_seek() to move around. this will enable
    packrat to cache, for instance, the xref tables instead of us parsing them
@@ -11,19 +94,10 @@
  - parse stream objects without reference to their /Length entry by simply
    trying all possible ways and consistency-checking them against the xref
    table in the end, via h_attr_bool().
+   XXX is this actually possible (without unreasonable complications)?
 
+ - investigate memory use on big documents (millions of objects).
+
+ - make custom token types for all appropriate parts of the parse result so
+   that they can be properly distinguished in the output.
  - include position information, at least for objects, in the (JSON) output.
- - format warnings/errors (stderr) as JSON, too.
-
- - make custom token types for all appropriate parts of the parse result.
-
- - parse content streams.
-
- - implement random-access parser (walking objects from /Root).
- - check linear and random-access parses for consistency.
-
- - handle garbage before %PDF- and after %%EOF
- - handle garbage at other points in the input?
-
- - add ASCII filter types.
- - add LZW filter.