commit f0a0c5e07a66bb204971cbb7a4bcebe8e1960519 from: Sven M. Hallberg date: Fri May 08 16:19:22 2020 UTC more TODO commit - 6c9bc1ec4f8b08250abae2dcdd866229b28a8727 commit + f0a0c5e07a66bb204971cbb7a4bcebe8e1960519 blob - cf50497b7e1996078704db487bda10174226587e blob + 70321e71d3a5794d35fdb53f91189589073b3d02 --- TODO +++ TODO @@ -1,3 +1,24 @@ + - fix the object stream parser to split input at logical boundaries, as + provided by the object index ("N pairs of integers") at the beginning of the + stream data. + + this follows discussion with peter wyatt where he initially said that the + objects should be delimited by normal PDF token rules, but PDFA then came + to the conclusion that, in fact, this was a mistake and the logical + begin/end info should delimit things. i.e. if your index says that an object + begins at offset 0 and ends at offset 3, followed by one that ends at 6, and + the input is "123456", this parses as two numbers, 123 and 456. + + currently the code follows the incorrect former approach, (re-) using the + "elemr" parser that is otherwise used with arrays. the above example would + parse as one element, the number 123456, in contradiction to the index + (which we parse but ignore). + + we have to explicitly walk the index, run our "obj" parser on each + respective snippet of input, and wrap the results up in a parse result. we + should also validate conditions on the index beforehand. these are + thankfully sane (monotonic offsets etc.) and mentioned in the spec. + - move main routine(s) and filter implementation(s) into separate source files. e.g.: - main.c: main function and helpers; starting from its include block @@ -11,6 +32,10 @@ a validation (h_attr_bool) should let the parse fail if applicable (severity vs. strictness). non-fatal violations should be extracted and printed to stderr after the parse. + - somehow rid VIOL() of the internal parser for getting at the severity + parameter. this is, i guess, an artefact of h_action() taking a single void + pointer of context, so it was not trivial to pass two arguments (message and + severity) to the action. - (maybe?) change stream parsing to just stop at "endstream endobj" when /Length is indirect and the filter or postordinate parser doesn't delimit @@ -27,7 +52,33 @@ one pure island. - parse and print content streams. + - parse/validate additional stream types/filters (images...). + - consider reviving the effort to get "obj" to parse with LALR. the messy + grammar for arrays with "elemd", "elemr", etc. still stems from project, as + does the explicit handling of whitespace -- note that TOK() is only used in + KW() and that no instances of KW() remain under "obj". + + alternatively, consider fully reverting the grammar to its clearer PEG form. + i would probably keep the explicit whitespace, though. + + what stopped me before was the difficulty to resolve some things without + precedence rules; specifically line endings in string literals. + is a "crlf" or a "cr" followed by an "lf"? LALR cannot decide + unless you encode that anything following a "cr" doesn't start with . + string literals are currently defined differently. the best way to do it, + AFAICS, would be to match (in string literals) all subsequent line endings + in one nonterminal and to encode there that a plain "cr" is never followed + by "lf". + + FWIW, the motivation for LALR parsing of "obj" was the prospect of parsing + an object stream incrementally, as chunks come in from the decompressor + (or an arbitrary filter chain). + + NB: the reason why we must distinguish "crlf" from "cr" "lf" at all is of + course that in a string literal, the former means "\n" and the latter means + "\n\n". + - implement random-access ("island") parser (walking objects from /Root). i'm not sure how much we need to know about the "DOM" for this. maybe nothing? since everything is built out of basic objects and we can just