[antlr-interest] Resolving ambiguities in Lexer rules

Joseph Areeda newsreply at areeda.com
Sat Aug 15 17:34:40 PDT 2009


Achint,

I am a newbie here too.  I don't mean to talk form a position of authority.

The trick is do you have keywords that are reserved.  Can you say in 
your language that Integer is a "type" or is it merely a string and only 
in certain contexts does it have a meaning?

Yes it is restrictive to have reserved words but it makes the parsing 
much easier.

Joe

Achint Mehta wrote:
> Hi Joe,
>
> Thanks for your response.
>
> You have proposed two solutions:
> 1. Replace ver with SPECIAL_STRING and check in the target code for 
> allowed values. This means that if I intent to collect a generic 
> unquoted string in a antlr parser, then I cannot use any tokens in the 
> whole parser. In a big parser, this seems to be a limitation, which 
> means that the target language program validates every string where 
> token should have been placed in the parser.
>
> 2. The second option is that all the tokens have to given as alternate 
> rules/token with SPECIAL_STRING. Again, in a big/complicated parser, 
> all the tokens in the whole parser have to be repeated where-ever I 
> intend to use the SPECIAL_STRING. This can be simplified if I give the 
> tokens in the definition of SPECIAL_STRING iteself. But still in a 
> parser which could use tens or hundreds of tokens, it would seem to be 
> impractical to repeat all the tokens in SPECIAL_STRING rule and other 
> similar rules (intended for collecting the generic string).
>
> The parser that I have put in the e-mail is a simplified version of 
> the issue I am facing. I am writing a SIP protocol message parser. The 
> very first line of a SIP message starts as (I am compressing the rules 
> for clarity):
>
> Method SPACE Request-URI ... (other rules follow)
> Method: "INVITE" | "ACK" | "OPTIONS" | "BYE" | "CANCEL" | "REGISTER"
> Request-URI boils down to : "sip:" [userinfo "@"] hostport 
> url-parameters [headers]
> and userinfo is an unquoted alpha-numeric string.
>
> if the SIP starts as REGISTER SIP:REGISTER at ...
> The parsing would fail if I write the rules as I mentioned in my 
> sample program earlier.
> SIP protocol is filled with rules such as userinfo where unquoted 
> alphs-numeric strings have to be collected and there are tens of 
> tokens in its grammar. This is a typical scenario for any protocol 
> grammar. I am not sure  repeating all tokens in rules or treating 
> everything as genric string would be a neat solution.
>
> I admit that I am a noob when it comes to familarity with other 
> lexers/parsers, and rest of them might require some other work-around 
> as well. But situation seems to be pretty common enough to have a 
> straight solution (though I might be wrong).
>
> Thanks.
>
> Regards,
> Achint 
>
>
>
>     I don't see this as an ambiguity issue but rather a decision of
>     whether
>     your grammar uses reserved words or not.
>     I'm not an expert by any means but that doesn't mean I don't have an
>     opinion just that you should take it with a grain of salt.
>
>     You can either handle this with a symbol table later in the process or
>     rewrite the requestline to something like
>     requestline : ver EQUAL (SPECIAL_STRING | ver);
>
>     Joe
>
>
>     Achint Mehta wrote:
>     > Hi All,
>     >
>     > The section "Ambiguities and Non determinisms" of the book "The
>     > definitive ANTLR guide" talks about the ambiguities in lexer rules,
>     > but I am not sure how to resolve them.
>     >
>     > Consider a following grammar which assigns a value to an ID. The ID
>     > can either be VERSION or COUNT while its value can be anything:
>     > ------------------------------
>     -----------------
>     > grammar sample_parser;
>     >
>     > requestline : ver EQUAL SPECIAL_STRING ;
>     >
>     > /* Tokens */
>     > ver:('VERSION'| 'V') {}
>     >       | ('COUNT' | 'C') {} ;
>     >
>     >
>     > SPECIAL_STRING:(CHAR)+ ;
>     > WHITESPACE: ' ';
>     > NEWLINE: ('\r')? '\n';
>     > EQUAL: '=';
>     >
>     > fragment
>     > CHAR: (('a'..'z')|('A'..'Z'));
>     > -----------------------------------------------
>     >
>     > If the input is given as
>     > VERSION=FIRST
>     > Then it works, but if following input is given
>     > VERSION=VERSION
>     > Then I get an error (MissingTokenException after the "=").
>     >
>     > How can this ambiguity be resolved ?
>     >
>     > Thanks in advance.
>     >
>     > Regards,
>     > Achint
>     >
>     ------------------------------------------------------------------------
>     >
>     >
>     > List: http://www.antlr.org/mailman/listinfo/antlr-interest
>     > Unsubscribe:
>     http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>     >
>
>
> ------------------------------------------------------------------------
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090815/9a9eb838/attachment.html 


More information about the antlr-interest mailing list