[antlr-interest] syntactic predicates vs. backtrack=true

Sun Feb 10 10:08:12 PST 2008

> -----Original Message-----
> From: Steve Bennett [mailto:stevagewp at gmail.com]
> Sent: Sunday, February 10, 2008 5:21 AM
> To: antlr-interest
> Subject: Re: [antlr-interest] syntactic predicates vs. backtrack=true
> 
> On 2/6/08, Jim Idle <jimi at temporal-wave.com> wrote:
> > There are some languages that are not necessarily good targets for 
an
> > ANTLR grammar and there isn't a good way to parse them without
> > backtracking or lots of predicates. Wiki markup languages are like
> that
> > for instance because they are a bit haphazard and tend to have lots
> of
> > context. If you are not trying to parse such a language, then it is
> > usually possible to create a syntactic parser that has very few
> > ambiguities. Experience and effort is required. Experience lets you
> spot
> > the things that are truly an ambiguity in the language spec vs
> something
> > that you could specify in a better way.
> 
> Hey Jim, would you mind expanding on why parsing Wiki markup with
> ANTLR is a bad idea? I ask because, uh, I'm trying to write a parser
> for wikitext with ANTLR :)

I thought you might be :-)

> The two problems you raise are "a bit
> haphazard" and "lots of context". For me, the first is the entire
> reason for the ANTLR grammar: to define all the rules rigidly (for the
> first time) so that end users of the wikitext actually have a fixed
> understanding of what the grammar is. 

Ah, well here you touch on two issues. I find a reasonably well thought 
out treatment of that in the VB.Net language spec, where it deals with 
what will happen if: the spec doesn't specify something and a new 
version of the compiler does something different than a previous 
version; the spec specifies something incorrectly; the compiler 
specifies something incorrectly; and so on. 

So, when you say "define all the rules rigidly", you can mean that you 
don't want to change the markup language at all, just fix it as it is 
now, in time, forever, or you could mean that you want to create an 
actual formal spec and weed out those constructs that are 'obviously' 
incorrect. The latter implies that pages that used to render correctly 
would no longer do so of course. I like wikis but I do think that the 
markup languages could have been a bit more formal without placing undue 
pressure on users to learn too much. The problem is that after the first 
week, when it seems easy, you start to run into things that ought to 
render but just print out literally. Because there is no rigid syntax, 
the wiki renderer cannot do anything sensible such as say "Did you leave 
out a ']' here --^ ?".

So, because invalid looking syntax just renders without error, you 
create a situation where all you can really do is try every possibility 
and just revert to a catch all "spit it out" rule if nothing matches. 
This is of course what backtracking will do for you. In the case of a 
wiki page, I would think that the overhead is fairly insignificant 
unless your higher level rules cause the entire page to be traversed 30 
times (which is what memoizing helps with of course). 

> And the second I haven't found
> too bad: a small number of flags like "prohibit_literal_pipe" has
> really solved that problem: rather than using "context" like "I'm
> deeply nested inside a table, therefore I can't start a line with a
> pipe", I push the restrictions down: "This table cell is normal text,
> except no line can start with a pipe".

Yes, such things are reasonable. There is no need to be a purist about 
such things. Personally I am much more practical about this sort of 
thing. If you end up with a grammar that no one would say is pretty but 
is eminently maintainable, then you have improved your lot providing it 
is fast enough. 

> 
> The biggest difficulty has been that *everything* is valid syntax: if
> it doesn't match some particular language construction, it just
> renders literally:
> 
> [[image:foo.jpg|thumbnail|this is a foo!]] <- that renders an image
> 
> [[image:foo.jpg|thumbnail|this is a foo!] <- that renders literally
> 
> Which means that potentially everything has to be parsed several times
> for different constructs, and if all that fails, it has to be parsed
> once more as literally. Is this what you mean by "allowing different
> syntax paths"?

Yes, but backtracking with memoizing will help a lot here as once the 
parser hits the same input point with the same rule and has found that 
does not match, it will find it memorized and just skip a second and 
subsequent attempt. So, with wiki markups, I suggest that having global 
level backtracking and memorization is in fact what you want.

> 
> Anyway, I'd like to know if there are other reasons why ANTLR is a bad
> choice for a wikitext parser.

No, not really. A series of regular expressions (such as {f}lex for 
instance) still traverses stuff and then rejects until it find something 
that works. That is probably faster because it is just whipping through 
characters, but is more difficult to maintain and less obvious that it 
lets through things you didnt mean to (which then become de facto 
standards ;-).

Basically, if you like using ANTLR then with enough ingenuity you will 
get that hammer to work just fine, but so long as you keep in mind that 
you are trying to deal with something that it is not exactly intended 
for and so don't beat up on the tool because it doesnt have a 
grappling hook, then just go ahead and beat it in to shape. If you are 
the person that defines what the language is then you might take the 
opportunity to cull some things from it. Having to escape things that 
render literally if you dont want the renderer to say "Did you leave 
out a ']' here?" probably only affects a few people?

> 
> Many thanks,
> Steve