[antlr-interest] lexer woes

Matt Benson gudnabrsam at yahoo.com
Mon Mar 3 12:53:54 PST 2008


I am working on a language with a fairly loose lexing
scheme.  I am running into all sorts of problems
specifying my lexer:  in particular I can't find any
evidence that backtracking works for lexer grammars. 
I tend to get NPEs building the NFAs when combining
synpreds, lexer grammars, and backtracking=true,
whether I use ANTLR 3.0.1 or a fairly recent 3.1
build.  I have had to use a strategy whereby any
possibly confusing tokens are generated from a single
lexer rule.  I'll include my current lexer grammar
that passes Tool generation; if anyone has the
time/inclination/interest to offer ideas how I could
have done things more cleanly I'd be glad to hear
about it.

Thanks (or not),
Matt

lexer grammar Loose;
options {k=1;}
tokens { Identifier; SEMI; SL_COMMENT; ML_COMMENT;}

EQUALS	:	'=';

StringLiteral
	:	'"' ( EscapeSequence | ~('\\'|'"') )* '"'
	;

fragment
EscapeSequence
	:	'\\'
		(	('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
		|	Unicode
		|	Octal
		)
    ;

fragment
Octal
options {k=3;}
    :   ('0'..'3') ('0'..'7') ('0'..'7')
    |	('0'..'7') ('0'..'7')?
    ;

fragment
Unicode
	:	'u' HexDigit HexDigit HexDigit HexDigit
	;

fragment
HexDigit
	:	('0'..'9'|'a'..'f'|'A'..'F')
	;

WS	:	(WsChar)+ {$channel=HIDDEN;}
	;

fragment
WsChar
	:	' '|'\r'|'\t'|'\u000C'|'\n'
	;

Token
	:	(';' WsChar)=>';' {$type=SEMI;}
	|	('//')=>LineComment {$type=SL_COMMENT;}
	|	('/*')=>Comment {$type=ML_COMMENT;}
	|	(TokenMark)=>TokenTail {$type=Token;}
	|	(	(Letter)=>Ident {$type=Identifier;}
		|	IDDigit (Letter|IDDigit)*
		)
		//the presence of a token tail overrides any
previously assigned token type:
		(TokenTail {$type=Token;})?
	;

fragment
LineComment
	:	'//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
	;

fragment
Comment
	:	'/*' ( options {greedy=false;} : . )* '*/'
{$channel=HIDDEN;}
	;

fragment
TokenTail
	:	TokenMark+ ((Letter|IDDigit)+ TokenTail?)?
	;

fragment
TokenMark
options {k=2;}
	:	EscapeSequence
	|	(';' ~(WsChar))=>';'//do not accept semicolon if
followed by WS
	|	~(Letter|IDDigit|WsChar|';'|'"'|EQUALS|'/')
	|	('/' ~('/'|'*'))=>'/'//do not accept '/' if LA
finds an upcoming SL/ML comment
	;

fragment
Ident
	:	Letter (Letter|IDDigit)*
	;

fragment
Letter
	:	'\u0024'
	|	'\u0041'..'\u005a'
	|	'\u005f'
	|	'\u0061'..'\u007a'
	|	'\u00c0'..'\u00d6'
	|	'\u00d8'..'\u00f6'
	|	'\u00f8'..'\u00ff'
	|	'\u0100'..'\u1fff'
	|	'\u3040'..'\u318f'
	|	'\u3300'..'\u337f'
	|	'\u3400'..'\u3d2d'
	|	'\u4e00'..'\u9fff'
	|	'\uf900'..'\ufaff'
	;

fragment
IDDigit
	:	'\u0030'..'\u0039'
	|	'\u0660'..'\u0669'
	|	'\u06f0'..'\u06f9'
	|	'\u0966'..'\u096f'
	|	'\u09e6'..'\u09ef'
	|	'\u0a66'..'\u0a6f'
	|	'\u0ae6'..'\u0aef'
	|	'\u0b66'..'\u0b6f'
	|	'\u0be7'..'\u0bef'
	|	'\u0c66'..'\u0c6f'
	|	'\u0ce6'..'\u0cef'
	|	'\u0d66'..'\u0d6f'
	|	'\u0e50'..'\u0e59'
	|	'\u0ed0'..'\u0ed9'
	|	'\u1040'..'\u1049'
	;




      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping


More information about the antlr-interest mailing list