[antlr-interest] Lexical nondeterminism

Fri Jan 13 02:51:51 PST 2006

Dear John,

What you suggested worked just fine apart form
"WS_ : (' ' | '\t') { $setType(SKIP); } ;" where when generating a C++
parser SKIP needs to be preceded by it's namespaces.

Thank you for your help!

Kind regards,
Gabriel

On 11/01/06, John B. Brodie <jbb at acm.org> wrote:
>
> Gabriel Radu asked:
> >I am trying to write a antler grammar and I am getting a following result:
> >
> >ANTLR Parser Generator   Version 2.7.5 (20050128)   1989-2005 jGuru.com
> >ServiceCompiler.g: warning:lexical nondeterminism between rules
> >INT_or_FLOAT_or_MACADR_or_VERSIONSTRING and DEFAULT upon
> >AuvitranServiceCompiler.g:     k==1:'D','d'
> >AuvitranServiceCompiler.g:     k==2:'E','e'
> >AuvitranServiceCompiler.g:     k==3:'F','f'
> >AuvitranServiceCompiler.g:     k==4:'A','a'
> >AuvitranServiceCompiler.g:     k==5:'U','u'
> >AuvitranServiceCompiler.g:     k==6:'L','l'
> >AuvitranServiceCompiler.g:     k==7:'T','t'
> >AuvitranServiceCompiler.g:     k==8:<end-of-token>
> >AuvitranServiceCompiler.g:     k==9:<end-of-token>
> >AuvitranServiceCompiler.g:     k==10:<end-of-token>
> >
> >The interesting parts of the lexer are:
> >
> >...lots of informative stuff snipped...
>
> You have:
>
> >protected INT
> >  :    (HEXDIG)+
> >;
>
> and
>
> >protected VERSIONSTRING_L
> >  : ( DIGIT )+ DOT ( DIGIT )+ DOT ( DIGIT )+ ('A'..'Z'|'a'..'z')?
> >;
> >
> >protected VERSIONSTRING_S
> >  : ( DIGIT )+ DOT ( DIGIT )+ ('A'..'Z'|'a'..'z')
> >;
> >
> >protected VERSIONSTRING : ;
> >
> >INT_or_FLOAT_or_MACADR_or_VERSIONSTRING
> >
> >   : ( DIGIT (DIGIT)? DOT DIGIT ( DIGIT (DIGIT)? )? DOT )
> >          => VERSIONSTRING_L { $setType( VERSIONSTRING ); }
> >
> >   | ( DIGIT (DIGIT)? DOT DIGIT ( DIGIT (DIGIT)? )? ('A'..'Z'|'a'..'z') )
> >          => VERSIONSTRING_S { $setType( VERSIONSTRING ); }
> >
> >   | ( ( DIGIT )+ DOT ) => FLOAT { $setType( FLOAT ); }
> >
> >   | ( HEXDIG HEXDIG MACADRSEPARATOR ) => MACADR { $setType( MACADR ); }
> >
> >   | ( ( DIGIT )+ ) => INT { $setType( INT ); }
> >
> >;
>
> and
>
> >DEFAULT:
> >    ('D' | 'd')
> >    ('E' | 'e')
> >    ('F' | 'f')
> >    ('A' | 'a')
> >    ('U' | 'u')
> >    ('L' | 'l')
> >    ('T' | 't')
> >;
>
> i believe that your ambiguity arises from INT being a sequence of
> HEXDIG (dispite the predicate in the INT_or_FLOAT_...whatever rule).
>
> thus the intput string `default` could be a DEFAULT or an INT followed
> by NONTOCLITs.
>
> while your k=10 lookahead would seem to be plenty to disambiguate this
> (just need to look at the first 5 symbols); it has been my
> exprience that lookahead is not considered when one of the items being
> considered is expressed as a loop (e.g. either ()+ or ()*). that is, Antlr
> will not try to do the 5 symbol lookahead before entering the INT loop.
>
> so if an INT really is a sequence of HEXDIG then you will need to add
> another predicated alternative to your INT_or_...whatever rule.
>
> on the other hand if an INT is really a sequence of DIGIT then just
> fix the protected INT rule and set the k=3 and (I think, not tested)
> and you will have fixed this ambiguity.
>
>
> on another issue which you did not (yet) ask about. you should be
> really careful with your syntax predicates. consider the input string
> "11.22.33.44.55.66". it would seem that this should scan as a MACADR,
> yet your predicate for VERSIONSTRING_L will match this string and you
> will end up scanning it as a VERSIONSTRING ("11.22.33") followed by DOT
> followed by another VERSIONSTRING (i think).
>
> attached is a version of your scanner that addresses this issue.
>
> hope this helps...
>
> //--------------------------begin attachment--------------------------
>
> //----------------------------------------------------------------------
> // Lexer
> //----------------------------------------------------------------------
>
> class ServiceLexer extends Lexer;
>
> //----------------------------------------------------------------------
> // White speace:
>
> WS_ : (' ' | '\t') { $setType(SKIP); } ;
>
> NEWLINE
>     : '\n' ( '\r' )?
>     | '\r' ( '\n' )?
> ;
>
>
> //----------------------------------------------------------------------
> // Chars:
>
> NONTOCLIT
>     :   'g'..'u' | 'x'..'z'
>     |   'G'..'U' | 'X'..'Z'
> ;
>
> protected LETTER : 'A'..'Z' | 'a'..'z' ;
>
>
>
> //----------------------------------------------------------------------
> // Numbers:
>
> protected DIGIT
>         :       '0'..'9'
> ;
>
> protected HEXLIT
>   : 'a'..'f' | 'A'..'F'
> ;
>
> protected HEXDIG
>   : ( DIGIT | HEXLIT )
> ;
>
> protected INT
>   :     ( HEXDIG )+
> ;
>
> protected FLOAT
>   : ( DIGIT )+ DOT ( DIGIT )+
> ;
>
> protected MACADRSEPARATOR
>   : DOT
> ;
>
> protected MACADR
>   :
>     HEXDIG HEXDIG MACADRSEPARATOR
>     HEXDIG HEXDIG MACADRSEPARATOR
>     HEXDIG HEXDIG MACADRSEPARATOR
>     HEXDIG HEXDIG MACADRSEPARATOR
>     HEXDIG HEXDIG MACADRSEPARATOR
>     HEXDIG HEXDIG
> ;
>
> protected VERSIONSTRING
>   : ( DIGIT )+ DOT ( DIGIT )+ ( ( DOT ( DIGIT )+ ( LETTER )? ) | LETTER )
> ;
>
> INT_or_FLOAT_or_MACADR_or_VERSIONSTRING_or_DEFAULT
>     : ( DEFAULT ) => ( DEFAULT { $setType( DEFAULT ); } )
>     | ( MACADR ) => ( MACADR { $setType( MACADR ); } )
>     | ( VERSIONSTRING ) => ( VERSIONSTRING { $setType( VERSIONSTRING ); } )
>     | ( FLOAT ) => ( FLOAT { $setType( FLOAT ); } )
>     | ( INT ) => ( INT { $setType( INT ); } )
> ;
>
>
>
> //----------------------------------------------------------------------
> // Punctuation:
>
> DOT:    '.' ;
>
> COMMA:  ',' ;
>
> COLON:  ':' ;
>
> SCOLON: ';' ;
>
>
>
> //[ some more text]
>
>
>
> //----------------------------------------------------------------------
> protected DEFAULT:
>     ('D' | 'd')
>     ('E' | 'e')
>     ('F' | 'f')
>     ('A' | 'a')
>     ('U' | 'u')
>     ('L' | 'l')
>     ('T' | 't')
> ;
>
>
> //---------------------------end attachment---------------------------
>
>
>