[antlr-interest] syntactic predicates vs. backtrack=true
Jim Idle
jimi at temporal-wave.com
Sun Feb 10 10:08:12 PST 2008
> -----Original Message-----
> From: Steve Bennett [mailto:stevagewp at gmail.com]
> Sent: Sunday, February 10, 2008 5:21 AM
> To: antlr-interest
> Subject: Re: [antlr-interest] syntactic predicates vs. backtrack=true
>
> On 2/6/08, Jim Idle <jimi at temporal-wave.com> wrote:
> > There are some languages that are not necessarily good targets for
an
> > ANTLR grammar and there isn't a good way to parse them without
> > backtracking or lots of predicates. Wiki markup languages are like
> that
> > for instance because they are a bit haphazard and tend to have lots
> of
> > context. If you are not trying to parse such a language, then it is
> > usually possible to create a syntactic parser that has very few
> > ambiguities. Experience and effort is required. Experience lets you
> spot
> > the things that are truly an ambiguity in the language spec vs
> something
> > that you could specify in a better way.
>
> Hey Jim, would you mind expanding on why parsing Wiki markup with
> ANTLR is a bad idea? I ask because, uh, I'm trying to write a parser
> for wikitext with ANTLR :)
I thought you might be :-)
> The two problems you raise are "a bit
> haphazard" and "lots of context". For me, the first is the entire
> reason for the ANTLR grammar: to define all the rules rigidly (for the
> first time) so that end users of the wikitext actually have a fixed
> understanding of what the grammar is.
Ah, well here you touch on two issues. I find a reasonably well thought
out treatment of that in the VB.Net language spec, where it deals with
what will happen if: the spec doesn't specify something and a new
version of the compiler does something different than a previous
version; the spec specifies something incorrectly; the compiler
specifies something incorrectly; and so on.
So, when you say "define all the rules rigidly", you can mean that you
don't want to change the markup language at all, just fix it as it is
now, in time, forever, or you could mean that you want to create an
actual formal spec and weed out those constructs that are 'obviously'
incorrect. The latter implies that pages that used to render correctly
would no longer do so of course. I like wikis but I do think that the
markup languages could have been a bit more formal without placing undue
pressure on users to learn too much. The problem is that after the first
week, when it seems easy, you start to run into things that ought to
render but just print out literally. Because there is no rigid syntax,
the wiki renderer cannot do anything sensible such as say "Did you leave
out a ']' here --^ ?".
So, because invalid looking syntax just renders without error, you
create a situation where all you can really do is try every possibility
and just revert to a catch all "spit it out" rule if nothing matches.
This is of course what backtracking will do for you. In the case of a
wiki page, I would think that the overhead is fairly insignificant
unless your higher level rules cause the entire page to be traversed 30
times (which is what memoizing helps with of course).
> And the second I haven't found
> too bad: a small number of flags like "prohibit_literal_pipe" has
> really solved that problem: rather than using "context" like "I'm
> deeply nested inside a table, therefore I can't start a line with a
> pipe", I push the restrictions down: "This table cell is normal text,
> except no line can start with a pipe".
Yes, such things are reasonable. There is no need to be a purist about
such things. Personally I am much more practical about this sort of
thing. If you end up with a grammar that no one would say is pretty but
is eminently maintainable, then you have improved your lot providing it
is fast enough.
>
> The biggest difficulty has been that *everything* is valid syntax: if
> it doesn't match some particular language construction, it just
> renders literally:
>
> [[image:foo.jpg|thumbnail|this is a foo!]] <- that renders an image
>
> [[image:foo.jpg|thumbnail|this is a foo!] <- that renders literally
>
> Which means that potentially everything has to be parsed several times
> for different constructs, and if all that fails, it has to be parsed
> once more as literally. Is this what you mean by "allowing different
> syntax paths"?
Yes, but backtracking with memoizing will help a lot here as once the
parser hits the same input point with the same rule and has found that
does not match, it will find it memorized and just skip a second and
subsequent attempt. So, with wiki markups, I suggest that having global
level backtracking and memorization is in fact what you want.
>
> Anyway, I'd like to know if there are other reasons why ANTLR is a bad
> choice for a wikitext parser.
No, not really. A series of regular expressions (such as {f}lex for
instance) still traverses stuff and then rejects until it find something
that works. That is probably faster because it is just whipping through
characters, but is more difficult to maintain and less obvious that it
lets through things you didnt mean to (which then become de facto
standards ;-).
Basically, if you like using ANTLR then with enough ingenuity you will
get that hammer to work just fine, but so long as you keep in mind that
you are trying to deal with something that it is not exactly intended
for and so don't beat up on the tool because it doesnt have a
grappling hook, then just go ahead and beat it in to shape. If you are
the person that defines what the language is then you might take the
opportunity to cull some things from it. Having to escape things that
render literally if you dont want the renderer to say "Did you leave
out a ']' here?" probably only affects a few people?
>
> Many thanks,
> Steve
More information about the antlr-interest
mailing list