[antlr-interest] Resolving ambiguities in Lexer rules
Joseph Areeda
newsreply at areeda.com
Sat Aug 15 17:34:40 PDT 2009
Achint,
I am a newbie here too. I don't mean to talk form a position of authority.
The trick is do you have keywords that are reserved. Can you say in
your language that Integer is a "type" or is it merely a string and only
in certain contexts does it have a meaning?
Yes it is restrictive to have reserved words but it makes the parsing
much easier.
Joe
Achint Mehta wrote:
> Hi Joe,
>
> Thanks for your response.
>
> You have proposed two solutions:
> 1. Replace ver with SPECIAL_STRING and check in the target code for
> allowed values. This means that if I intent to collect a generic
> unquoted string in a antlr parser, then I cannot use any tokens in the
> whole parser. In a big parser, this seems to be a limitation, which
> means that the target language program validates every string where
> token should have been placed in the parser.
>
> 2. The second option is that all the tokens have to given as alternate
> rules/token with SPECIAL_STRING. Again, in a big/complicated parser,
> all the tokens in the whole parser have to be repeated where-ever I
> intend to use the SPECIAL_STRING. This can be simplified if I give the
> tokens in the definition of SPECIAL_STRING iteself. But still in a
> parser which could use tens or hundreds of tokens, it would seem to be
> impractical to repeat all the tokens in SPECIAL_STRING rule and other
> similar rules (intended for collecting the generic string).
>
> The parser that I have put in the e-mail is a simplified version of
> the issue I am facing. I am writing a SIP protocol message parser. The
> very first line of a SIP message starts as (I am compressing the rules
> for clarity):
>
> Method SPACE Request-URI ... (other rules follow)
> Method: "INVITE" | "ACK" | "OPTIONS" | "BYE" | "CANCEL" | "REGISTER"
> Request-URI boils down to : "sip:" [userinfo "@"] hostport
> url-parameters [headers]
> and userinfo is an unquoted alpha-numeric string.
>
> if the SIP starts as REGISTER SIP:REGISTER at ...
> The parsing would fail if I write the rules as I mentioned in my
> sample program earlier.
> SIP protocol is filled with rules such as userinfo where unquoted
> alphs-numeric strings have to be collected and there are tens of
> tokens in its grammar. This is a typical scenario for any protocol
> grammar. I am not sure repeating all tokens in rules or treating
> everything as genric string would be a neat solution.
>
> I admit that I am a noob when it comes to familarity with other
> lexers/parsers, and rest of them might require some other work-around
> as well. But situation seems to be pretty common enough to have a
> straight solution (though I might be wrong).
>
> Thanks.
>
> Regards,
> Achint
>
>
>
> I don't see this as an ambiguity issue but rather a decision of
> whether
> your grammar uses reserved words or not.
> I'm not an expert by any means but that doesn't mean I don't have an
> opinion just that you should take it with a grain of salt.
>
> You can either handle this with a symbol table later in the process or
> rewrite the requestline to something like
> requestline : ver EQUAL (SPECIAL_STRING | ver);
>
> Joe
>
>
> Achint Mehta wrote:
> > Hi All,
> >
> > The section "Ambiguities and Non determinisms" of the book "The
> > definitive ANTLR guide" talks about the ambiguities in lexer rules,
> > but I am not sure how to resolve them.
> >
> > Consider a following grammar which assigns a value to an ID. The ID
> > can either be VERSION or COUNT while its value can be anything:
> > ------------------------------
> -----------------
> > grammar sample_parser;
> >
> > requestline : ver EQUAL SPECIAL_STRING ;
> >
> > /* Tokens */
> > ver:('VERSION'| 'V') {}
> > | ('COUNT' | 'C') {} ;
> >
> >
> > SPECIAL_STRING:(CHAR)+ ;
> > WHITESPACE: ' ';
> > NEWLINE: ('\r')? '\n';
> > EQUAL: '=';
> >
> > fragment
> > CHAR: (('a'..'z')|('A'..'Z'));
> > -----------------------------------------------
> >
> > If the input is given as
> > VERSION=FIRST
> > Then it works, but if following input is given
> > VERSION=VERSION
> > Then I get an error (MissingTokenException after the "=").
> >
> > How can this ambiguity be resolved ?
> >
> > Thanks in advance.
> >
> > Regards,
> > Achint
> >
> ------------------------------------------------------------------------
> >
> >
> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
> > Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
> >
>
>
> ------------------------------------------------------------------------
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090815/9a9eb838/attachment.html
More information about the antlr-interest
mailing list