[antlr-interest] lexer woes

Mon Mar 3 13:57:20 PST 2008

This one's easy--unfortunately.  Ter does not yet use FOLLOW sets in the lexer, and that tends to cause havoc with your nicely factored grammar.  Also, you have gone overboard on using fragment rules where they are not particularly appropriate (all of your conmments, for example).

Can comments really be turned into tokens if followed by odd characters?  This seems really strange.

Anyway, I would suggest factoring out a comment rule and either inline most of the fragments or wait until Ter adds in FOLLOW set usage.

--Loring

----- Original Message ----
> From: Matt Benson <gudnabrsam at yahoo.com>
> To: Antlr List <antlr-interest at antlr.org>
> Sent: Monday, March 3, 2008 12:53:54 PM
> Subject: [antlr-interest] lexer woes
> 
> I am working on a language with a fairly loose lexing
> scheme.  I am running into all sorts of problems
> specifying my lexer:  in particular I can't find any
> evidence that backtracking works for lexer grammars. 
> I tend to get NPEs building the NFAs when combining
> synpreds, lexer grammars, and backtracking=true,
> whether I use ANTLR 3.0.1 or a fairly recent 3.1
> build.  I have had to use a strategy whereby any
> possibly confusing tokens are generated from a single
> lexer rule.  I'll include my current lexer grammar
> that passes Tool generation; if anyone has the
> time/inclination/interest to offer ideas how I could
> have done things more cleanly I'd be glad to hear
> about it.
> 
> Thanks (or not),
> Matt
> 
> lexer grammar Loose;
> options {k=1;}
> tokens { Identifier; SEMI; SL_COMMENT; ML_COMMENT;}
> 
> EQUALS    :    '=';
> 
> StringLiteral
>     :    '"' ( EscapeSequence | ~('\\'|'"') )* '"'
>     ;
> 
> fragment
> EscapeSequence
>     :    '\\'
>         (    ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>         |    Unicode
>         |    Octal
>         )
>     ;
> 
> fragment
> Octal
> options {k=3;}
>     :   ('0'..'3') ('0'..'7') ('0'..'7')
>     |    ('0'..'7') ('0'..'7')?
>     ;
> 
> fragment
> Unicode
>     :    'u' HexDigit HexDigit HexDigit HexDigit
>     ;
> 
> fragment
> HexDigit
>     :    ('0'..'9'|'a'..'f'|'A'..'F')
>     ;
> 
> WS    :    (WsChar)+ {$channel=HIDDEN;}
>     ;
> 
> fragment
> WsChar
>     :    ' '|'\r'|'\t'|'\u000C'|'\n'
>     ;
> 
> Token
>     :    (';' WsChar)=>';' {$type=SEMI;}
>     |    ('//')=>LineComment {$type=SL_COMMENT;}
>     |    ('/*')=>Comment {$type=ML_COMMENT;}
>     |    (TokenMark)=>TokenTail {$type=Token;}
>     |    (    (Letter)=>Ident {$type=Identifier;}
>         |    IDDigit (Letter|IDDigit)*
>         )
>         //the presence of a token tail overrides any
> previously assigned token type:
>         (TokenTail {$type=Token;})?
>     ;
> 
> fragment
> LineComment
>     :    '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>     ;
> 
> fragment
> Comment
>     :    '/*' ( options {greedy=false;} : . )* '*/'
> {$channel=HIDDEN;}
>     ;
> 
> fragment
> TokenTail
>     :    TokenMark+ ((Letter|IDDigit)+ TokenTail?)?
>     ;
> 
> fragment
> TokenMark
> options {k=2;}
>     :    EscapeSequence
>     |    (';' ~(WsChar))=>';'//do not accept semicolon if
> followed by WS
>     |    ~(Letter|IDDigit|WsChar|';'|'"'|EQUALS|'/')
>     |    ('/' ~('/'|'*'))=>'/'//do not accept '/' if LA
> finds an upcoming SL/ML comment
>     ;
> 
> fragment
> Ident
>     :    Letter (Letter|IDDigit)*
>     ;
> 
> fragment
> Letter
>     :    '\u0024'
>     |    '\u0041'..'\u005a'
>     |    '\u005f'
>     |    '\u0061'..'\u007a'
>     |    '\u00c0'..'\u00d6'
>     |    '\u00d8'..'\u00f6'
>     |    '\u00f8'..'\u00ff'
>     |    '\u0100'..'\u1fff'
>     |    '\u3040'..'\u318f'
>     |    '\u3300'..'\u337f'
>     |    '\u3400'..'\u3d2d'
>     |    '\u4e00'..'\u9fff'
>     |    '\uf900'..'\ufaff'
>     ;
> 
> fragment
> IDDigit
>     :    '\u0030'..'\u0039'
>     |    '\u0660'..'\u0669'
>     |    '\u06f0'..'\u06f9'
>     |    '\u0966'..'\u096f'
>     |    '\u09e6'..'\u09ef'
>     |    '\u0a66'..'\u0a6f'
>     |    '\u0ae6'..'\u0aef'
>     |    '\u0b66'..'\u0b6f'
>     |    '\u0be7'..'\u0bef'
>     |    '\u0c66'..'\u0c6f'
>     |    '\u0ce6'..'\u0cef'
>     |    '\u0d66'..'\u0d6f'
>     |    '\u0e50'..'\u0e59'
>     |    '\u0ed0'..'\u0ed9'
>     |    '\u1040'..'\u1049'
>     ;
> 
> 
> 
> 
>       
> ____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search.  
> http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> 

      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ