[antlr-interest] Re: lexer question from newbie

Thu Dec 22 18:23:42 PST 2005

"Terence Parr" <parrt at cs.usfca.edu> wrote in message
news:C1F9B0BE-4ED0-4FA3-B1ED-387FC2362B83 at cs.usfca.edu...
>
> On Dec 22, 2005, at 4:40 PM, Stuart wrote:
>
> > I am trying to use Antlr for the first time.  I have used
> > yacc/lex only a couple times, and this is my first time
> > using an LL parser so I am basically clueless... :-)
> >
> > Below is my attempt at a lexer for a simple LaTeX-like
> > language.  It was fine until I added the ECHR rule.
> > Now, I get a warning:
> >   latex.g: warning:lexical nondeterminism between rules CMD and
> > TEXT upon
> >   latex.g:     k==1:'\\'
> >   latex.g:     k==2:'A'..'Z','a'..'z'
>
> I think this is a limitation of the linear approx lookahead.  It
> improperly combines sets to \A looks like TEXT can match it.  Since
> CMD is first, it will resolve properly, however.  You can ignore the
> error I'm pretty sure.  Sorry for the ugliness.
>
> Ter

Thanks for the reply, but I'm not sure that's the problem.
test2.g is the lexer with the problem I posted (and copy of
below).  test1.g is the same lexer but with the ECHR rule
removed (which works ok).  The test2 grammar gives the
same (wrong) results (exception) whether the the CMD
rule is at the start of the rules, or the end.

Here are results produced by both lexers (edited to add
token names after the token numbers):

C:> python test2.py
\cmd text \xcmd ytext
["\cmd",<6>CMD,line=1,col=1]
error: exception caught while lexing:  unexpected char: 'x'

C:> python test1.py
\cmd text \xcmd ytext
["\cmd",<6>CMD,line=1,col=1]
[" text ",<8>TEXT,line=1,col=5]
["\xcmd",<6>CMD,line=1,col=11]
^Z
[" ytext
",<8>TEXT,line=1,col=16]

Test2 fails the same way whether the CMD rule is at the top
or the bottom of the rule set.

Below is test2.g again (test1.g is same except ECHR was
removed.)  As I said, when compiled test2 generates the
warning:
  test2.g: warning:lexical nondeterminism between rules CMD and TEXT upon
  test2.g:     k==1:'\\'
  test2.g:     k==2:'A'..'Z','a'..'z'
But when ECHR is removed (test1.g) no warning is generated,
and the result is correct (my definition of correct :-).

test2.g:
---------
options {language="Python";}
class test2 extends Lexer;
options {
    k = 2;
    charVocabulary='\u0000'..'\u007F'; // ascii
    }
LCB : '{' ;
RCB : '}' ;
protected
ECHR : '\\' (' ' | '&' | '$' | '%' | '{' | '}') ;
protected
TCHR : (~( '\\' | '{' | '}' | '[' | ']' | '\n')) ;
TEXT : (TCHR | ECHR | '\n' {$nl})+ ;
CMD : '\\' ( 'a'..'z' | 'A'..'Z' )+ ('*')? ;