[antlr-interest] lexer woes

Matt Benson gudnabrsam at yahoo.com
Tue Mar 4 14:05:01 PST 2008


Lest my other questions be lost in the noise, I am
still confused as to:

1) Whether backtracking mode is supported for lexers,
and
2) How to specify lexer options (particularly "global"
k) in a combined grammar.

-Matt

--- Matt Benson <gudnabrsam at yahoo.com> wrote:

> 
> --- Loring Craymer <lgcraymer at yahoo.com> wrote:
> 
> > This one's easy--unfortunately.  Ter does not yet
> > use FOLLOW sets in the lexer, and that tends to
> > cause havoc with your nicely factored grammar. 
> > Also, you have gone overboard on using fragment
> > rules where they are not particularly appropriate
> > (all of your conmments, for example).
> > 
> > Can comments really be turned into tokens if
> > followed by odd characters?  This seems really
> > strange.
> > 
> 
> No, that wasn't my intention.  Ugh, I had my comment
> rules factored out properly but kept getting told
> they
> were unreachable, despite my awareness of
> order-of-rules issues, etc.  However, I just changed
> my default k back to 2, put SL_COMMENT and
> ML_COMMENT
> before Token, and now it seems the Tool wants to
> disable Token for // and /* as is proper.  Not sure
> why I couldn't get it working before but that
> problem
> appears to be solved.  That said I guess I should
> keep
> playing around for awhile here...
> 
> > Anyway, I would suggest factoring out a comment
> rule
> > and either inline most of the fragments or wait
> > until Ter adds in FOLLOW set usage.
> > 
> 
> Is that in the plan?  I don't pretend to understand
> the whole follow set thing, but Google tells me it
> has
> lots of stuff for me to read and I'm still working
> my
> way through the Dragon book which I imagine probably
> contains some relevant info as well.
> 
> Thanks, Loring.
> 
> > --Loring
> > 
> > ----- Original Message ----
> > > From: Matt Benson <gudnabrsam at yahoo.com>
> > > To: Antlr List <antlr-interest at antlr.org>
> > > Sent: Monday, March 3, 2008 12:53:54 PM
> > > Subject: [antlr-interest] lexer woes
> > > 
> > > I am working on a language with a fairly loose
> > lexing
> > > scheme.  I am running into all sorts of problems
> > > specifying my lexer:  in particular I can't find
> > any
> > > evidence that backtracking works for lexer
> > grammars. 
> > > I tend to get NPEs building the NFAs when
> > combining
> > > synpreds, lexer grammars, and backtracking=true,
> > > whether I use ANTLR 3.0.1 or a fairly recent 3.1
> > > build.  I have had to use a strategy whereby any
> > > possibly confusing tokens are generated from a
> > single
> > > lexer rule.  I'll include my current lexer
> grammar
> > > that passes Tool generation; if anyone has the
> > > time/inclination/interest to offer ideas how I
> > could
> > > have done things more cleanly I'd be glad to
> hear
> > > about it.
> > > 
> > > Thanks (or not),
> > > Matt
> > > 
> > > lexer grammar Loose;
> > > options {k=1;}
> > > tokens { Identifier; SEMI; SL_COMMENT;
> > ML_COMMENT;}
> > > 
> > > EQUALS    :    '=';
> > > 
> > > StringLiteral
> > >     :    '"' ( EscapeSequence | ~('\\'|'"') )*
> '"'
> > >     ;
> > > 
> > > fragment
> > > EscapeSequence
> > >     :    '\\'
> > >         (   
> ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
> > >         |    Unicode
> > >         |    Octal
> > >         )
> > >     ;
> > > 
> > > fragment
> > > Octal
> > > options {k=3;}
> > >     :   ('0'..'3') ('0'..'7') ('0'..'7')
> > >     |    ('0'..'7') ('0'..'7')?
> > >     ;
> > > 
> > > fragment
> > > Unicode
> > >     :    'u' HexDigit HexDigit HexDigit HexDigit
> > >     ;
> > > 
> > > fragment
> > > HexDigit
> > >     :    ('0'..'9'|'a'..'f'|'A'..'F')
> > >     ;
> > > 
> > > WS    :    (WsChar)+ {$channel=HIDDEN;}
> > >     ;
> > > 
> > > fragment
> > > WsChar
> > >     :    ' '|'\r'|'\t'|'\u000C'|'\n'
> > >     ;
> > > 
> > > Token
> > >     :    (';' WsChar)=>';' {$type=SEMI;}
> > >     |    ('//')=>LineComment {$type=SL_COMMENT;}
> > >     |    ('/*')=>Comment {$type=ML_COMMENT;}
> > >     |    (TokenMark)=>TokenTail {$type=Token;}
> > >     |    (    (Letter)=>Ident
> {$type=Identifier;}
> > >         |    IDDigit (Letter|IDDigit)*
> > >         )
> > >         //the presence of a token tail overrides
> > any
> > > previously assigned token type:
> > >         (TokenTail {$type=Token;})?
> > >     ;
> > > 
> > > fragment
> > > LineComment
> > >     :    '//' ~('\n'|'\r')* '\r'? '\n'
> > {$channel=HIDDEN;}
> > >     ;
> > > 
> > > fragment
> > > Comment
> > >     :    '/*' ( options {greedy=false;} : . )*
> > '*/'
> > > {$channel=HIDDEN;}
> > >     ;
> > > 
> > > fragment
> > > TokenTail
> > >     :    TokenMark+ ((Letter|IDDigit)+
> > TokenTail?)?
> > >     ;
> > > 
> > > fragment
> > > TokenMark
> > > options {k=2;}
> > >     :    EscapeSequence
> > >     |    (';' ~(WsChar))=>';'//do not accept
> > semicolon if
> > > followed by WS
> > >     |   
> > ~(Letter|IDDigit|WsChar|';'|'"'|EQUALS|'/')
> > >     |    ('/' ~('/'|'*'))=>'/'//do not accept
> '/'
> > if LA
> > > finds an upcoming SL/ML comment
> > >     ;
> > > 
> > > fragment
> > > Ident
> > >     :    Letter (Letter|IDDigit)*
> > >     ;
> > > 
> > > fragment
> > > Letter
> > >     :    '\u0024'
> > >     |    '\u0041'..'\u005a'
> > >     |    '\u005f'
> > >     |    '\u0061'..'\u007a'
> > >     |    '\u00c0'..'\u00d6'
> > >     |    '\u00d8'..'\u00f6'
> > >     |    '\u00f8'..'\u00ff'
> > >     |    '\u0100'..'\u1fff'
> > >     |    '\u3040'..'\u318f'
> > >     |    '\u3300'..'\u337f'
> > >     |    '\u3400'..'\u3d2d'
> > >     |    '\u4e00'..'\u9fff'
> > >     |    '\uf900'..'\ufaff'
> > >     ;
> > > 
> > > fragment
> > > IDDigit
> > >     :    '\u0030'..'\u0039'
> > >     |    '\u0660'..'\u0669'
> > >     |    '\u06f0'..'\u06f9'
> > >     |    '\u0966'..'\u096f'
> > >     |    '\u09e6'..'\u09ef'
> > >     |    '\u0a66'..'\u0a6f'
> > >     |    '\u0ae6'..'\u0aef'
> > >     |    '\u0b66'..'\u0b6f'
> > >     |    '\u0be7'..'\u0bef'
> > >     |    '\u0c66'..'\u0c6f'
> > >     |    '\u0ce6'..'\u0cef'
> > >     |    '\u0d66'..'\u0d6f'
> > >     |    '\u0e50'..'\u0e59'
> > >     |    '\u0ed0'..'\u0ed9'
> > >     |    '\u1040'..'\u1049'
> > >     ;
> > > 
> > > 
> > > 
> > > 
> > >       
> > >
> >
>
____________________________________________________________________________________
> > > Looking for last minute shopping deals?  
> > > Find them fast with Yahoo! Search.  
> > >
> >
>
http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> > > 
> > 
> > 
> > 
> > 
> >      
> >
>
____________________________________________________________________________________
> > Be a better friend, newshound, and 
> > know-it-all with Yahoo! Mobile.  Try it now. 
> >
>
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > 
> > 
> > 
> 
> 
> 
>      
>
____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search. 
>
http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> 



      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


More information about the antlr-interest mailing list