[antlr-interest] ANTLR 2.7.6/C++: parser controlled conditional lexer whitespace skipping?

Mon Sep 11 03:01:39 PDT 2006

Hello all,

For a project I'm trying to create a CSS 2.1 parser (I started from the 
ccs2.1.g shared grammar on the website).

The shared css2.1.g grammar has the following whitespace lexer rule:

WS:	( ' '
	| '\t'
	| '\f'
	| ( options { generateAmbigWarnings = false; }
		: "\r\n"
		| '\r'
		| '\n'
		) { newline(); }
	)+ { _ttype = antlr::Token::SKIP; } // C++
	;

Now whitespace is both whitespace and a combinator in CSS2.1 (See 
section 5.2 paragraph 3 of specification - "Was there no better 
alternative for that particular combinator that was both human and 
machine-readable? - ).  So skipping whitespace in the lexer doesn't 
look like a good idea.

I've been looking at some strategies as how to solve this, but got 
stuck.

1. Handle whitespace explicitly in the parser. This look like a viable  
strategy, but is probably a lot of (hopefully unneeded?) work.

2. Use the 'ignore=WS' option. For CSS 2.1. you'd have to ignore WS on 
the starting rule  of the grammar (it's whitespace most of the time). 
As far as I could tell this propagates down into subrules. I could not 
find however how to reset this option on a subrule.

3.  A conditional Token::SKIP in the WS lexer rule:

WS:	( ' '
	| '\t'
	| '\f'
	| ( options { generateAmbigWarnings = false; }
		: "\r\n"
		| '\r'
		| '\n'
		) { newline(); }
	)+ { if (preserveWS == false) _ttype = antlr::Token::SKIP; } // C++
	;

In this case you would want the starting rule of the grammar to set 
'preserveWS' to 'false' and have the 'entry'-action of a subrule (near 
where you are parsing the combinator) set 'preserveWS' to 'true'. This 
leads to 2 problems:

How can parser and lexer interact? As far as I could tell the parser 
has no visibility to the lexer, only to the lexer's enclosed 
tokenstream. This means I could add a method to the lexer: public: 
setPreserveWS(bool mode = true) { this->preserveWS = mode; }. But I'm 
unsure if I could ever call this method from the parser.

There does not seem to be an 'exit'-action. How could the 'preserveWS' 
be safely reset to 'false' when the combinator subrule has been 
recognized/failed. Perhaps I would need to specify the same action in 
every branch of the subrule.

Looking at the note in the documentation regarding TokenStream 
filtering this seems like the best alternative - no costly creation of 
WS tokens when there is no need for them.

4. Use a variation on the 'TokenStreamBasicFilter'. This way the lexer 
does not skip WS, but puts it in the TokenStream. One could make a 
'CustomTokenStreamFilter', that allows you to toggle preserveWS in the 
filter. Except: how do I get to the filter (i.e. tokenstream) from the 
parser? I managed to find: this->getInputState().getInput() to arrive 
at the TokenBuffer. The TokenBuffer does not seem to have a (public) 
method to produce it's associated TokenStream.

Perhaps there are some other strategies that I didn't think of.

Could someone help me with this?

Thanks in advance.

With kind regards,
Peter Paulus