[antlr-interest] syntactic predicates vs. backtrack=true

Sun Feb 10 17:35:07 PST 2008

On 2/11/08, Jim Idle <jimi at temporal-wave.com> wrote:
> So, when you say "define all the rules rigidly", you can mean that you
> don't want to change the markup language at all, just fix it as it is
> now, in time, forever, or you could mean that you want to create an
> actual formal spec and weed out those constructs that are 'obviously'
> incorrect. The latter implies that pages that used to render correctly
> would no longer do so of course. I like wikis but I do think that the

Fortunately, the notion of "render correctly" is not well defined, and
is hence subject to negotiation. We have reached a compromise along
the lines of "correct is whatever the current parser does, unless
people think that's incorrect". Or similarly, "correct is whatever
makes the most sense out of the existing corpus of wikitext".

IMHO, the default behaviour of rendering literally is really an
artefact of the regexp-based parser currently in use. There is strong
agreement that a "syntax error" should not prevent a page saving, but
I think people are open to the idea that a clear error could render
specially (eg, in red), and a list of errors/warnings/suggestions be
generated.

> markup languages could have been a bit more formal without placing undue
> pressure on users to learn too much. The problem is that after the first

Yeah. And some of the construcs are just painfully bad. '''foo''' is
bold. ''foo'' is italics. ''''foo''' is apostrophe + bold. Except that
''''foo''' blah ''''' turns them into something completely different.
And for the ultimate (ok, not quite) in non LALR:

{|
|this is a table cell. except that a table cell can be preceded by a
CSS "style" property. Which is what this text will turn out to be if
ever we hit a single pipe. Which could be a very, very long time. |
Oops, there it is. Turns out it *was* a style property after all! ||
Now here's another cell. The same story applies. || Ah, turns out that
one didn't have a style property...
|}

> week, when it seems easy, you start to run into things that ought to
> render but just print out literally. Because there is no rigid syntax,
> the wiki renderer cannot do anything sensible such as say "Did you leave
> out a ']' here --^ ?".

Hopefully with The New Parser (still a fair way off...for starters we
need a PHP language target) this will become possible.

> So, because invalid looking syntax just renders without error, you
> create a situation where all you can really do is try every possibility
> and just revert to a catch all "spit it out" rule if nothing matches.
> This is of course what backtracking will do for you. In the case of a
> wiki page, I would think that the overhead is fairly insignificant
> unless your higher level rules cause the entire page to be traversed 30
> times (which is what memoizing helps with of course).

It's funny, I never even considered switching backtracking on. I may
have tried it once, and the grammar didn't work, so I turned it off
again. In any case, I don't think I would have really understand how
ANTLR worked if I'd had it on from the start. I also suspect that for
reasons mentioned in another thread, better performance may arise from
targeted synpreds. Performance *is* quite important.

> Yes, such things are reasonable. There is no need to be a purist about
> such things. Personally I am much more practical about this sort of
> thing. If you end up with a grammar that no one would say is pretty but
> is eminently maintainable, then you have improved your lot providing it
> is fast enough.

Grasshopper still can't distinguish "pretty" from "ugly" and
"maintanable" from "horrible mess".

> Yes, but backtracking with memoizing will help a lot here as once the
> parser hits the same input point with the same rule and has found that
> does not match, it will find it memorized and just skip a second and
> subsequent attempt. So, with wiki markups, I suggest that having global
> level backtracking and memorization is in fact what you want.

Yeah, I only just discovered memozing. Perhaps it should be the default?

> No, not really. A series of regular expressions (such as {f}lex for
> instance) still traverses stuff and then rejects until it find something
> that works. That is probably faster because it is just whipping through
> characters, but is more difficult to maintain and less obvious that it
> lets through things you didnt mean to (which then become de facto
> standards ;-).

Plus, if you look at the current parser, it ends up doing a lot of
work twice, as it has to keep applying the same constructs. "Replace
every X with Y...except that there could be links, so first replace
those with something else, then put them back afterwards".

> grappling hook, then just go ahead and beat it in to shape. If you are
> the person that defines what the language is then you might take the
> opportunity to cull some things from it. Having to escape things that
> render literally if you dont want the renderer to say "Did you leave
> out a ']' here?" probably only affects a few people?

Right. There is agreement that the language should change as little as
possible with the implementation of the new parser (mostly for social
reasons), but I'm looking forward to bending the language around a bit
after that ;)

Steve