[antlr-interest] ignoring lexer rules

Ed Sinjiashvili edsin at swes.saren.ru
Fri Mar 1 03:50:47 PST 2002


Hi,

I've tried to ask Terence about this issue and he pointed me to this
ML. So here I am. Suppose I have the following grammar(that describes
literal strings with escaped octal numbers inside):

-----
class Dummy extends Lexer;
options
{
    charVocabulary = '\3'..'\177';
}

{
    char scanOct(String txt)
	{
		char result = 0;
		try 
		{
			result = (char) Integer.parseInt(txt, 8);
		}
		catch (NumberFormatException e)
		{
			result = 0;
		}
		return result;
	}
}

STR: '"' ( c = ESCAPE { text.append(c); } 
         | ~('\\' | '"')
         )*
     '"'
    ;

protected
ESCAPE! returns [char c = 0]
 	: 	'\\'!
        '0'..'7' 
        (options {warnWhenFollowAmbig = false;} : '0'..'7'
            (options {warnWhenFollowAmbig = false;} : '0'..'7')? )? 
        { c = scanOct($getText); } 
 	;
-----

I'd like the tokenizer to return my strings already interpolated -
that is escaped octals should be converted to a char - and parser
should not be able to tell whether particular character was in string
literally or resulted from escape substitution. Naturally I used '!'
on ESCAPE rule to discard matched octals and backslash. This resulted
in the following java code (narrowed to not include irrelevant stuff):

-----
protected final char  mESCAPE(boolean _createToken) throws RecognitionException, CharStreamException, TokenStreamException {
	char c = 0; int _ttype; Token _token=null; int _begin=text.length();
	_ttype = ESCAPE; 
	int _saveIndex;
		
	_saveIndex=text.length();
	match('\\');
	text.setLength(_saveIndex);
	_saveIndex=text.length();
	matchRange('0','7');
	text.setLength(_saveIndex);
        [ ... skipped ... ]

	c = scanOct(new String(text.getBuffer(),_begin,text.length()-_begin));
	if ( _createToken && _token==null && _ttype!=Token.SKIP ) {
		_token = makeToken(_ttype);
		_token.setText(new String(text.getBuffer(), _begin, text.length()-_begin));
	}
	_returnToken = _token;
	return c;
	}
-----

ANTLR just wraps every alternative with "_saveIndex = text.length();"
and "text.setLength(_saveIndex);". This causes my scanOct method to
fail - all matched stuff was discarded by subsequent "_saveIndex"
wrappers. Besides it looks a little wrong to me - we know that we
gonna discard all the text, we know where it starts and we know where
it ends. Why just don't cut it before trying to create a token
instance? This way actions can access matched text and mess with it. 

I've patched ANTLR.2.7.2a2's JavaCodeGenerator so it produces the
following java code (no narrowing now):

-----
protected final char  mESCAPE(boolean _createToken) throws RecognitionException, CharStreamException, TokenStreamException {
		char c = 0;
		int _ttype; Token _token=null; int _begin=text.length();
		_ttype = ESCAPE;
		int _saveIndex;
		
		_saveIndex=text.length();
		match('\\');
		text.setLength(_saveIndex);
		matchRange('0','7');
		{
		if (((LA(1) >= '0' && LA(1) <= '7'))) {
			matchRange('0','7');
			{
			if (((LA(1) >= '0' && LA(1) <= '7'))) {
				matchRange('0','7');
			}
			else if (((LA(1) >= '\u0003' && LA(1) <= '\u007f'))) {
			}
			else {
				throw new NoViableAltForCharException((char)LA(1), getFilename(), getLine(), getColumn());
			}
			
			}
		}
		else if (((LA(1) >= '\u0003' && LA(1) <= '\u007f'))) {
		}
		else {
			throw new NoViableAltForCharException((char)LA(1), getFilename(), getLine(), getColumn());
		}
		
		}
		c = scanOct(new String(text.getBuffer(),_begin,text.length()-_begin));
		text.setLength(_begin);
		if ( _createToken && _token==null && _ttype!=Token.SKIP ) {
		   _token = makeToken(_ttype);
		   _token.setText(new String(text.getBuffer(), _begin, text.length()-_begin));
		}
		_returnToken = _token;
		return c;
	}
-----	

As you can see I'm still able to exclude arbitrary matches (backslash
in the example) from text, then text is available to action. Finally
I just discard all the text with "text.setLength(_begin)". Thus
exclamaited rule (the one with ! mark) is seen to actions like ordinal
rule - the only difference is that text is not propogated. To put it
more formally - these two pairs of rules are not equivalent in current
ANTLR-2.7.2a2 (IMHO they should be identical):

----- first pair
STR: '"' ( (! c = ESCAPE) { text.append(c); } 
         | ~('\\' | '"')
         )*
     '"'
    ;

protected
ESCAPE returns [char c = 0]
 	: '\\'!
        '0'..'7' ('0'..'7' ('0'..'7')? )? 
        { c = scanOct($getText); } 
 	;

----- second pair
STR: '"' ( c = ESCAPE { text.append(c); } 
         | ~('\\' | '"')
         )*
     '"'
    ;

protected
ESCAPE! returns [char c = 0]
 	: '\\'!
        '0'..'7' 
        (options {warnWhenFollowAmbig = false;} : '0'..'7'
            (options {warnWhenFollowAmbig = false;} : '0'..'7')? )? 
        { c = scanOct($getText); } 
 	;
-----


--Ed 


 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



More information about the antlr-interest mailing list