[antlr-interest] lexer woes
Matt Benson
gudnabrsam at yahoo.com
Mon Mar 3 12:53:54 PST 2008
I am working on a language with a fairly loose lexing
scheme. I am running into all sorts of problems
specifying my lexer: in particular I can't find any
evidence that backtracking works for lexer grammars.
I tend to get NPEs building the NFAs when combining
synpreds, lexer grammars, and backtracking=true,
whether I use ANTLR 3.0.1 or a fairly recent 3.1
build. I have had to use a strategy whereby any
possibly confusing tokens are generated from a single
lexer rule. I'll include my current lexer grammar
that passes Tool generation; if anyone has the
time/inclination/interest to offer ideas how I could
have done things more cleanly I'd be glad to hear
about it.
Thanks (or not),
Matt
lexer grammar Loose;
options {k=1;}
tokens { Identifier; SEMI; SL_COMMENT; ML_COMMENT;}
EQUALS : '=';
StringLiteral
: '"' ( EscapeSequence | ~('\\'|'"') )* '"'
;
fragment
EscapeSequence
: '\\'
( ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| Unicode
| Octal
)
;
fragment
Octal
options {k=3;}
: ('0'..'3') ('0'..'7') ('0'..'7')
| ('0'..'7') ('0'..'7')?
;
fragment
Unicode
: 'u' HexDigit HexDigit HexDigit HexDigit
;
fragment
HexDigit
: ('0'..'9'|'a'..'f'|'A'..'F')
;
WS : (WsChar)+ {$channel=HIDDEN;}
;
fragment
WsChar
: ' '|'\r'|'\t'|'\u000C'|'\n'
;
Token
: (';' WsChar)=>';' {$type=SEMI;}
| ('//')=>LineComment {$type=SL_COMMENT;}
| ('/*')=>Comment {$type=ML_COMMENT;}
| (TokenMark)=>TokenTail {$type=Token;}
| ( (Letter)=>Ident {$type=Identifier;}
| IDDigit (Letter|IDDigit)*
)
//the presence of a token tail overrides any
previously assigned token type:
(TokenTail {$type=Token;})?
;
fragment
LineComment
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
;
fragment
Comment
: '/*' ( options {greedy=false;} : . )* '*/'
{$channel=HIDDEN;}
;
fragment
TokenTail
: TokenMark+ ((Letter|IDDigit)+ TokenTail?)?
;
fragment
TokenMark
options {k=2;}
: EscapeSequence
| (';' ~(WsChar))=>';'//do not accept semicolon if
followed by WS
| ~(Letter|IDDigit|WsChar|';'|'"'|EQUALS|'/')
| ('/' ~('/'|'*'))=>'/'//do not accept '/' if LA
finds an upcoming SL/ML comment
;
fragment
Ident
: Letter (Letter|IDDigit)*
;
fragment
Letter
: '\u0024'
| '\u0041'..'\u005a'
| '\u005f'
| '\u0061'..'\u007a'
| '\u00c0'..'\u00d6'
| '\u00d8'..'\u00f6'
| '\u00f8'..'\u00ff'
| '\u0100'..'\u1fff'
| '\u3040'..'\u318f'
| '\u3300'..'\u337f'
| '\u3400'..'\u3d2d'
| '\u4e00'..'\u9fff'
| '\uf900'..'\ufaff'
;
fragment
IDDigit
: '\u0030'..'\u0039'
| '\u0660'..'\u0669'
| '\u06f0'..'\u06f9'
| '\u0966'..'\u096f'
| '\u09e6'..'\u09ef'
| '\u0a66'..'\u0a6f'
| '\u0ae6'..'\u0aef'
| '\u0b66'..'\u0b6f'
| '\u0be7'..'\u0bef'
| '\u0c66'..'\u0c6f'
| '\u0ce6'..'\u0cef'
| '\u0d66'..'\u0d6f'
| '\u0e50'..'\u0e59'
| '\u0ed0'..'\u0ed9'
| '\u1040'..'\u1049'
;
____________________________________________________________________________________
Looking for last minute shopping deals?
Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping
More information about the antlr-interest
mailing list