[antlr-interest] lexer woes

Matt Benson gudnabrsam at yahoo.com
Tue Mar 4 14:55:09 PST 2008


--- Loring Craymer <lgcraymer at yahoo.com> wrote:

> 1.)  Yes--see calls to prefixWithSynPred() in
> antlr.g

Hmm.  The reason I asked is that I continue to get
NPEs whenever I turn on backtracking in my lexer
grammar and run Tool against it.

> 2.)  ANTLR 3 defaults to k=*; the best approach is
> to leave k alone.  For ANTLR 2, k was to find a
> minimum value that removed ambiguities; for ANTLR 3,
> a fixed k is the maximum value investigated for any
> decision and so weakens the analysis relative to
> k=*.

Again, if I don't set k=2 for my lexer grammar, it
disables rules that I don't want disabled.  As this
grammar is intended for OSS anyway, I've posted it at:

http://people.apache.org/~mbenson/sharedfiles/BantamLexer.g3

if anyone feels like playing with it.

-Matt

> 
> --Loring
> 
> ----- Original Message ----
> > From: Matt Benson <gudnabrsam at yahoo.com>
> > To: Antlr List <antlr-interest at antlr.org>
> > Sent: Tuesday, March 4, 2008 2:05:01 PM
> > Subject: Re: [antlr-interest] lexer woes
> > 
> > Lest my other questions be lost in the noise, I am
> > still confused as to:
> > 
> > 1) Whether backtracking mode is supported for
> lexers,
> > and
> > 2) How to specify lexer options (particularly
> "global"
> > k) in a combined grammar.
> > 
> > -Matt
> > 
> > --- Matt Benson  wrote:
> > 
> > > 
> > > --- Loring Craymer  wrote:
> > > 
> > > > This one's easy--unfortunately.  Ter does not
> yet
> > > > use FOLLOW sets in the lexer, and that tends
> to
> > > > cause havoc with your nicely factored grammar.
> 
> > > > Also, you have gone overboard on using
> fragment
> > > > rules where they are not particularly
> appropriate
> > > > (all of your conmments, for example).
> > > > 
> > > > Can comments really be turned into tokens if
> > > > followed by odd characters?  This seems really
> > > > strange.
> > > > 
> > > 
> > > No, that wasn't my intention.  Ugh, I had my
> comment
> > > rules factored out properly but kept getting
> told
> > > they
> > > were unreachable, despite my awareness of
> > > order-of-rules issues, etc.  However, I just
> changed
> > > my default k back to 2, put SL_COMMENT and
> > > ML_COMMENT
> > > before Token, and now it seems the Tool wants to
> > > disable Token for // and /* as is proper.  Not
> sure
> > > why I couldn't get it working before but that
> > > problem
> > > appears to be solved.  That said I guess I
> should
> > > keep
> > > playing around for awhile here...
> > > 
> > > > Anyway, I would suggest factoring out a
> comment
> > > rule
> > > > and either inline most of the fragments or
> wait
> > > > until Ter adds in FOLLOW set usage.
> > > > 
> > > 
> > > Is that in the plan?  I don't pretend to
> understand
> > > the whole follow set thing, but Google tells me
> it
> > > has
> > > lots of stuff for me to read and I'm still
> working
> > > my
> > > way through the Dragon book which I imagine
> probably
> > > contains some relevant info as well.
> > > 
> > > Thanks, Loring.
> > > 
> > > > --Loring
> > > > 
> > > > ----- Original Message ----
> > > > > From: Matt Benson 
> > > > > To: Antlr List 
> > > > > Sent: Monday, March 3, 2008 12:53:54 PM
> > > > > Subject: [antlr-interest] lexer woes
> > > > > 
> > > > > I am working on a language with a fairly
> loose
> > > > lexing
> > > > > scheme.  I am running into all sorts of
> problems
> > > > > specifying my lexer:  in particular I can't
> find
> > > > any
> > > > > evidence that backtracking works for lexer
> > > > grammars. 
> > > > > I tend to get NPEs building the NFAs when
> > > > combining
> > > > > synpreds, lexer grammars, and
> backtracking=true,
> > > > > whether I use ANTLR 3.0.1 or a fairly recent
> 3.1
> > > > > build.  I have had to use a strategy whereby
> any
> > > > > possibly confusing tokens are generated from
> a
> > > > single
> > > > > lexer rule.  I'll include my current lexer
> > > grammar
> > > > > that passes Tool generation; if anyone has
> the
> > > > > time/inclination/interest to offer ideas how
> I
> > > > could
> > > > > have done things more cleanly I'd be glad to
> > > hear
> > > > > about it.
> > > > > 
> > > > > Thanks (or not),
> > > > > Matt
> > > > > 
> > > > > lexer grammar Loose;
> > > > > options {k=1;}
> > > > > tokens { Identifier; SEMI; SL_COMMENT;
> > > > ML_COMMENT;}
> > > > > 
> > > > > EQUALS    :    '=';
> > > > > 
> > > > > StringLiteral
> > > > >     :    '"' ( EscapeSequence | ~('\\'|'"')
> )*
> > > '"'
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > EscapeSequence
> > > > >     :    '\\'
> > > > >         (   
> > > ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
> > > > >         |    Unicode
> > > > >         |    Octal
> > > > >         )
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > Octal
> > > > > options {k=3;}
> > > > >     :   ('0'..'3') ('0'..'7') ('0'..'7')
> > > > >     |    ('0'..'7') ('0'..'7')?
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > Unicode
> > > > >     :    'u' HexDigit HexDigit HexDigit
> HexDigit
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > HexDigit
> > > > >     :    ('0'..'9'|'a'..'f'|'A'..'F')
> > > > >     ;
> > > > > 
> > > > > WS    :    (WsChar)+ {$channel=HIDDEN;}
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > WsChar
> > > > >     :    ' '|'\r'|'\t'|'\u000C'|'\n'
> > > > >     ;
> > > > > 
> > > > > Token
> > > > >     :    (';' WsChar)=>';' {$type=SEMI;}
> > > > >     |    ('//')=>LineComment
> {$type=SL_COMMENT;}
> > > > >     |    ('/*')=>Comment {$type=ML_COMMENT;}
> > > > >     |    (TokenMark)=>TokenTail
> {$type=Token;}
> > > > >     |    (    (Letter)=>Ident
> > > {$type=Identifier;}
> > > > >         |    IDDigit (Letter|IDDigit)*
> > > > >         )
> > > > >         //the presence of a token tail
> overrides
> > > > any
> > > > > previously assigned token type:
> > > > >         (TokenTail {$type=Token;})?
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > LineComment
> > > > >     :    '//' ~('\n'|'\r')* '\r'? '\n'
> > > > {$channel=HIDDEN;}
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > Comment
> > > > >     :    '/*' ( options {greedy=false;} : .
> )*
> > > > '*/'
> > > > > {$channel=HIDDEN;}
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > TokenTail
> > > > >     :    TokenMark+ ((Letter|IDDigit)+
> > > > TokenTail?)?
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > TokenMark
> > > > > options {k=2;}
> > > > >     :    EscapeSequence
> > > > >     |    (';' ~(WsChar))=>';'//do not accept
> > > > semicolon if
> > > > > followed by WS
> > > > >     |   
> > > > ~(Letter|IDDigit|WsChar|';'|'"'|EQUALS|'/')
> > > > >     |    ('/' ~('/'|'*'))=>'/'//do not
> accept
> > > '/'
> > > > if LA
> > > > > finds an upcoming SL/ML comment
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > Ident
> > > > >     :    Letter (Letter|IDDigit)*
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > Letter
> > > > >     :    '\u0024'
> > > > >     |    '\u0041'..'\u005a'
> > > > >     |    '\u005f'
> > > > >     |    '\u0061'..'\u007a'
> > > > >     |    '\u00c0'..'\u00d6'
> > > > >     |    '\u00d8'..'\u00f6'
> > > > >     |    '\u00f8'..'\u00ff'
> > > > >     |    '\u0100'..'\u1fff'
> > > > >     |    '\u3040'..'\u318f'
> > > > >     |    '\u3300'..'\u337f'
> > > > >     |    '\u3400'..'\u3d2d'
> > > > >     |    '\u4e00'..'\u9fff'
> > > > >     |    '\uf900'..'\ufaff'
> > > > >     ;
> > > > > 
> > > > > fragment
> > > > > IDDigit
> > > > >     :    '\u0030'..'\u0039'
> > > > >     |    '\u0660'..'\u0669'
> > > > >     |    '\u06f0'..'\u06f9'
> > > > >     |    '\u0966'..'\u096f'
> > > > >     |    '\u09e6'..'\u09ef'
> > > > >     |    '\u0a66'..'\u0a6f'
> > > > >     |    '\u0ae6'..'\u0aef'
> > > > >     |    '\u0b66'..'\u0b6f'
> > > > >     |    '\u0be7'..'\u0bef'
> > > > >     |    '\u0c66'..'\u0c6f'
> > > > >     |    '\u0ce6'..'\u0cef'
> > > > >     |    '\u0d66'..'\u0d6f'
> > > > >     |    '\u0e50'..'\u0e59'
> > > > >     |    '\u0ed0'..'\u0ed9'
> > > > >     |    '\u1040'..'\u1049'
> > > > >     ;
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > >       
> > > > >
> > > >
> > >
> >
>
____________________________________________________________________________________
> > > > > Looking for last minute shopping deals?  
> > > > > Find them fast with Yahoo! Search.  
> > > > >
> > > >
> > >
> >
>
http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > >      
> > > >
> > >
> >
>
____________________________________________________________________________________
> > > > Be a better friend, newshound, and 
> > > > know-it-all with Yahoo! Mobile.  Try it now. 
> > > >
> > >
> >
>
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > >      
> > >
> >
>
____________________________________________________________________________________
> > > Looking for last minute shopping deals?  
> > > Find them fast with Yahoo! Search. 
> > >
> >
>
http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> > > 
> > 
> > 
> > 
> >       
> >
>
____________________________________________________________________________________
> > Never miss a thing.  Make Yahoo your home page. 
> > http://www.yahoo.com/r/hs
> > 
> 
> 
> 
> 
>      
>
____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search. 
>
http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> 



      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 



More information about the antlr-interest mailing list