[antlr-interest] Resolving ambiguities in Lexer rules

Achint Mehta achintmehta at gmail.com
Sat Aug 15 19:19:45 PDT 2009


Hi Joe,

I appreciate you helping me.
I am learning ANTLR, and from time to time need some ideas for
brainstorming.

The SIP parser I am working on parses SIP protocol message. SIP protocol is
text based, like HTTP.

There are places where only select keywords can appear but these keywords
can appear in other places also as normal string (with or without quotes).
So the keywords are not reserved in true sense.

Also, the integer is not a strict data-type. An input pattern 12 (without
any quotes) can be treated as integer or a normal string depending where it
is occurring.

I wish I could control the syntax/semantics of the language but SIP is a
well established protocol.

Thanks again.

Regards,
Achint

On Sat, Aug 15, 2009 at 8:34 PM, Joseph Areeda <newsreply at areeda.com> wrote:

>  Achint,
>
> I am a newbie here too.  I don't mean to talk form a position of authority.
>
> The trick is do you have keywords that are reserved.  Can you say in your
> language that Integer is a "type" or is it merely a string and only in
> certain contexts does it have a meaning?
>
> Yes it is restrictive to have reserved words but it makes the parsing much
> easier.
>
> Joe
>
> Achint Mehta wrote:
>
> Hi Joe,
>
> Thanks for your response.
>
> You have proposed two solutions:
> 1. Replace ver with SPECIAL_STRING and check in the target code for allowed
> values. This means that if I intent to collect a generic unquoted string in
> a antlr parser, then I cannot use any tokens in the whole parser. In a big
> parser, this seems to be a limitation, which means that the target language
> program validates every string where token should have been placed in the
> parser.
>
> 2. The second option is that all the tokens have to given as alternate
> rules/token with SPECIAL_STRING. Again, in a big/complicated parser, all the
> tokens in the whole parser have to be repeated where-ever I intend to use
> the SPECIAL_STRING. This can be simplified if I give the tokens in the
> definition of SPECIAL_STRING iteself. But still in a parser which could use
> tens or hundreds of tokens, it would seem to be impractical to repeat all
> the tokens in SPECIAL_STRING rule and other similar rules (intended for
> collecting the generic string).
>
> The parser that I have put in the e-mail is a simplified version of the
> issue I am facing. I am writing a SIP protocol message parser. The very
> first line of a SIP message starts as (I am compressing the rules for
> clarity):
>
> Method SPACE Request-URI ... (other rules follow)
> Method: "INVITE" | "ACK" | "OPTIONS" | "BYE" | "CANCEL" | "REGISTER"
> Request-URI boils down to : "sip:" [userinfo "@"] hostport url-parameters
> [headers]
> and userinfo is an unquoted alpha-numeric string.
>
> if the SIP starts as REGISTER SIP:REGISTER at ...
> The parsing would fail if I write the rules as I mentioned in my sample
> program earlier.
> SIP protocol is filled with rules such as userinfo where unquoted
> alphs-numeric strings have to be collected and there are tens of tokens in
> its grammar. This is a typical scenario for any protocol grammar. I am not
> sure  repeating all tokens in rules or treating everything as genric string
> would be a neat solution.
>
> I admit that I am a noob when it comes to familarity with other
> lexers/parsers, and rest of them might require some other work-around as
> well. But situation seems to be pretty common enough to have a straight
> solution (though I might be wrong).
>
> Thanks.
>
> Regards,
> Achint
>
>
>>
>> I don't see this as an ambiguity issue but rather a decision of whether
>> your grammar uses reserved words or not.
>> I'm not an expert by any means but that doesn't mean I don't have an
>> opinion just that you should take it with a grain of salt.
>>
>> You can either handle this with a symbol table later in the process or
>> rewrite the requestline to something like
>> requestline : ver EQUAL (SPECIAL_STRING | ver);
>>
>> Joe
>>
>>
>> Achint Mehta wrote:
>> > Hi All,
>> >
>> > The section "Ambiguities and Non determinisms" of the book "The
>> > definitive ANTLR guide" talks about the ambiguities in lexer rules,
>> > but I am not sure how to resolve them.
>> >
>> > Consider a following grammar which assigns a value to an ID. The ID
>> > can either be VERSION or COUNT while its value can be anything:
>> > ------------------------------ -----------------
>> > grammar sample_parser;
>> >
>> > requestline : ver EQUAL SPECIAL_STRING ;
>> >
>> > /* Tokens */
>> > ver:('VERSION'| 'V') {}
>> >       | ('COUNT' | 'C') {} ;
>> >
>> >
>> > SPECIAL_STRING:(CHAR)+ ;
>> > WHITESPACE: ' ';
>> > NEWLINE: ('\r')? '\n';
>> > EQUAL: '=';
>> >
>> > fragment
>> > CHAR: (('a'..'z')|('A'..'Z'));
>> > -----------------------------------------------
>> >
>> > If the input is given as
>> > VERSION=FIRST
>> > Then it works, but if following input is given
>> > VERSION=VERSION
>> > Then I get an error (MissingTokenException after the "=").
>> >
>> > How can this ambiguity be resolved ?
>> >
>> > Thanks in advance.
>> >
>> > Regards,
>> > Achint
>> > ------------------------------------------------------------------------
>> >
>> >
>> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> > Unsubscribe:
>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>> >
>>
>
> ------------------------------
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090815/b19ec520/attachment.html 


More information about the antlr-interest mailing list