[antlr-interest] DMQL Grammar - ANTLR Eats Characters

Fri Mar 20 14:56:27 PDT 2009

Hey Jim,

You are right. I have to embrace the philosophy of the tool. It's my fault I
didn't learn enough about it in due time, even though I did buy the book.
It's just that we sometimes try to learn just enough, however much we feel
that is, and continue.

DMQL proves a great example of nice, albeit ambiguous grammar, but one that
poses quite a few challenges to the less well versed ANTLR coder. The
grammar is aimed at describing a simple language, and as a result dates,
strings, numbers are neither enclosed in quotes nor introduced by special
keywords. Yes, there's ambiguity, like with the tokens having special
meaning, but that's something we have to live with at the parsing stage and
we can figure out later (based on field metadata). In my implementation,
TODAY will be interpreted as a string if at all possible. Only if it's in a
range does it conjure up a date.

I had started with longer tokens; I'm feeling that single character tokens
can cause unnecessary backtracking during the parsing stage. But I kept
having issues with either ranges or numbers or ISO date parsing. It would
either parse "123-456" as a string followed by a negative number, which
would then fail, or choke on "123--456", or choke on the time zone specifier
-12:34 thinking there's a negative number in there, etc. Surely those
predicates introduced by => would have helped (they're called syntactic
predicates, I think), but it seemed like a lot to give just to get there;
that's how I ended with the simplest of tokens, like A and D, and moved much
of the alphanumerics into parser rules.

To be truthful, I wanted to capture as much validation as I could into the
grammar. I could have worked with NumberOrStringOrTodayTokenOrAndToken kind
of tokens and had them figured out in code, but I felt it could be done
without. (And it can, but not necessarily in an elegant manner: short of
some missing prefix enumerations, the grammar works; my test coverage is
pretty solid.)
I didn't manage to keep the grammar that clean. As you can see by looking at
it, there are already some less optimal results due to the massaging that
took place here and there. For example, why does fieldValue accept only
non-integer numbers? Well, it doesn't, it's just that integers are matched
by other rules. etc., etc. Sometimes I almost feel a nice, albeit ambiguous
representation of the grammar should be possible, and it should be fixable
with additional rules. That could make for a more readable grammar.
Anyway, my grammar had long been fixed with prefix enumerations when I
noticed Indhu's initial reply today, but some notes on the possibility of
enhancements for the tool still remain. As posted in my previous e-mail, the
grammar validator could point out such instances where the lexer might have
weaknesses. And the automatic recovery feature seems a bit dangerous if
enabled by default; at least it did let an error sneak by my unit tests for
several months.

Let's not hesitate to point out, however, that we do have a great and
powerful tool on our hands here. :)

Mihai

2009/3/20 Jim Idle <jimi at temporal-wave.com>

>  Mihai Danila wrote:
>
>
>
> A question still remains. If the lexer cannot create a valid token without
> dropping characters, shouldn't it fall back and try to produce smaller
> tokens (which my grammar allows for, the smaller tokens being D and A) to
> give a chance to the parser? Apparently, the lexer is prematurely moving
> into an error state without noticing that a different token arrangement
> would keep it in the green.
>
>  Remember that this is not {f}lex. The lexer does not try each possible
> match in turn then go on to the next when one fails. ANTLR lexers are more
> programmatic in nature; they are both more flexible (no pun intended) and
> more prone to getting things wrong without more explicit instructions
> (though some things are earmarked for improvement).
>
> If you do this type of thing though:
>
> fragment TODAY : ;
> fragment TOMORROW : ;
> fragment D;
>
> Alpha :
>       ('TODAY')=>'TODAY' { $type = TODAY; }
>     | ('TOMORROW')=>'TOMMOROW' { $type = TOMORROW; }
>     | {canBeD}?=> 'D' {$type = D; }
>     | ('a'..'z'|'A'..'Z')
>     ;
>
> ErrChar : . { Record illegal char err; skip(); } ;
>
>
> BTW - We all need to get past assuming that the longest match thing is
> causing problems. In most cases this isn't the issue per se, but lack of
> enough guidance to the rules to tell ANTLR what you want the outcomes to be.
> There are various opinions on what the analysis should do by default of
> course and some of that I believe Ter has already said he is going to try to
> make more 'intuitive'.
>
> But for now, if you remember that ANTLR will just look for enough 'stuff'
> to decide that this is the only rule that can work from here and it does not
> come back after going down that path and say "Oh, that was wrong, let me try
> a simpler/the next rule", unless you tell it that that is what you want a
> little more explicitly than you would in flex.
>
> In the end though, if you don't cover all the bases, you will just get a
> mismatched character, and you should not get such a thing from a lexer.
> Also, it is always better to accept any old rubbish in a token when you can
> and then verify the char sequence later. So, if you can't have '\u0567' in
> an ID, accept it anyway (assuming it isn't some other valid separator etc),
> then when you see the ID token in the parser, call isValidId() and print
> "Error - the identifier ... cannot contain char '\u0567' - line 45, offset
> 17". This makes much more sense the "Illegal char:" from the default lexer
> error.
>
> Similarly, do not try to match 'D' on its own. Just look for ID in the
> parser then have a check to see that it is a D, or separate the parsing of
> that type of construct into an island grammar or external grammar. Divide
> and conquer etc.
>
>
> Jim
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090320/a44d7090/attachment.html