[antlr-interest] ANTLR 2.7.6/C++: parser controlled conditional lexer whitespace skipping?

Fri Sep 15 02:39:04 PDT 2006

Hello all,

Meanwhile I've completed a large part of my first strategy: explicit 
whitespace in the grammar.

The grammar below does parse successfully. Below the grammar and a 
(non-coherent) sample input are in included.

On the declarationBlock rule however I keep having difficulties that 
the last declaration, if specified with the "!important" clause, may 
not omit the SEMICOLON. Whereas a declaration, without the "!important" 
clause, may have an optional SEMICOLOM, optional whitespace before the 
RBRACE.

For instance these are okay:
{border: green solid 1pt}
{border: green solid 1pt;}
{border: green solid 1pt; }
{border: green solid 1pt !important;}

But this isn't:
{border: green solid 1pt !important}

I keep getting the non-determenism between 1 alternative of expr and 
the exit block.

By the way I found that whitespace had a third overloaded meaning: it 
acts as a list separator in expr.

Could someone help me iron out this last bit?

On the other strategies, did I miss the obvious; Or was I not 
netiquette enough with my little CSS whitespace observation? I had 
certainly hoped there would be an alternate approach to landing this. 
But since no one responded I guess this question stays open.

As far as what I've come up with concerns. I still have to do comments 
which probably adds an other layer of complexity.

With kind regards,
Peter Paulus

ANTLR 2.7.6 grammar (on request I can give you the entire grammar, 
including lexer):
class CSSParser extends Parser;
options {
	k=3;
	buildAST=true;
}

tokens {
FUNCTION;
DECLARATION;
}

stylesheet: (WS!)? (charset (WS!)?)? (import (WS!)?)*  ((medium | 
ruleset | page) (WS!)?)*;

charset: CHARSET_SYMBOL^ WS! string (WS!)? SEMICOLON;

import: IMPORT_SYMBOL^ WS! string (WS!)? (IDENT (COMMA (WS!)? IDENT 
(WS!)?)*)? SEMICOLON;

medium:	MEDIA_SYMBOL^
		WS!
		IDENT
		(WS!)?
		(COMMA (WS!)? IDENT (WS!)?)*
		LBRACE (WS!)?
		(ruleset (WS!)?)*
		(RBRACE | EOF)
		;

ruleset:	compositeselector
		LBRACE
		(WS!)?
		(declarationBlock)?
		(RBRACE | EOF)
		;

page: 	PAGE_SYMBOL^
		(WS!)?
		(COLON IDENT (WS!)?)?
		LBRACE
		(WS!)?
		(declarationBlock)?
		(RBRACE | EOF)
		;

compositeselector: selector
		(
			((WS)? COMMA)=> (WS!)? COMMA (WS!)? selector
			| ((WS)? PLUS)=> (WS!)? PLUS (WS!)?selector
			| ((WS)? GREATER) => (WS!)? GREATER (WS!)? selector
			| (WS LBRACE) => WS!
			| WS selector
		)*
		;

selector: (typeselector | universalselector) (idselector | 
classselector | attributeselector)* (pseudo)?
		| (idselector | classselector | attributeselector)+ (pseudo)?
		;

typeselector: IDENT;

universalselector: STAR;

idselector: HASH^ IDENT;

classselector: DOT^ IDENT;

attributeselector: LSQUARE^ RSQUARE; // incomplete rule

pseudo: COLON^ IDENT;

declarationBlock:   declaration
		(
			 (SEMICOLON (WS)? (RBRACE|EOF))=> SEMICOLON (WS!)?
  				| SEMICOLON (WS!)? declaration
		)*
		;

declaration: 	id:IDENT^
			(WS!)?
			COLON
			(WS!)?
			expr
			(IMPORTANT_SYMBOL)?
			{ #id->setType(DECLARATION); }
			;

expr:		term
			(
				(WS (SEMICOLON | RBRACE | EOF | IMPORTANT_SYMBOL))=> WS!
				| (WS term)=> WS term
  				| (WS COMMA)=> WS! COMMA (WS!)? term
				| (WS SLASH)=> WS! SLASH (WS!)? term
				| SLASH (WS!)? term
  				| COMMA (WS!)? term
			)*
			;

term
	: ((PLUS|MINUS)? DIGIT)=> length
	| hexcolor
	| function // including rgb(), srgb(), url()
	| string
	| identifier // including wellknown colors
	| keyword
	;

string: STRING;

keyword:	AUTO | INHERIT;

identifier: IDENT;

function: id:IDENT^ LPARENTHESIS arguments RPARENTHESIS { 
#id->setType(FUNCTION); };

arguments: (term)? ((COMMA|SLASH) WS term)*;

hexcolor :  hexLiteral;

hexLiteral: HASH (HEX_DIGIT HEX_DIGIT HEX_DIGIT) (HEX_DIGIT HEX_DIGIT 
HEX_DIGIT)?;

length:  NUM^ (unit)?;

unit: PERCENT | PX | PT | PC |  CM | MM | IN | DEG | RAD | GRAD | MS | 
S | KHZ | HZ;

-------------- next part --------------
A non-text attachment was scrubbed...
Name: table.css
Type: model/vrml
Size: 527 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20060915/9a727744/attachment.vrml 
-------------- next part --------------

Begin forwarded message:

> From: Peter Paulus <peter.paulus at nerocmediaware.nl>
> Date: Mon Sep 11, 2006  12:01:39 Europe/Amsterdam
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] ANTLR 2.7.6/C++: parser controlled 
> conditional lexer whitespace skipping?
>
> Hello all,
>
> For a project I'm trying to create a CSS 2.1 parser (I started from 
> the ccs2.1.g shared grammar on the website).
>
> The shared css2.1.g grammar has the following whitespace lexer rule:
>
> WS:	( ' '
> 	| '\t'
> 	| '\f'
> 	| ( options { generateAmbigWarnings = false; }
> 		: "\r\n"
> 		| '\r'
> 		| '\n'
> 		) { newline(); }
> 	)+ { _ttype = antlr::Token::SKIP; } // C++
> 	;
>
> Now whitespace is both whitespace and a combinator in CSS2.1 (See 
> section 5.2 paragraph 3 of specification - "Was there no better 
> alternative for that particular combinator that was both human and 
> machine-readable? - ).  So skipping whitespace in the lexer doesn't 
> look like a good idea.
>
> I've been looking at some strategies as how to solve this, but got 
> stuck.
>
> 1. Handle whitespace explicitly in the parser. This look like a viable 
>  strategy, but is probably a lot of (hopefully unneeded?) work.
>
> 2. Use the 'ignore=WS' option. For CSS 2.1. you'd have to ignore WS on 
> the starting rule  of the grammar (it's whitespace most of the time). 
> As far as I could tell this propagates down into subrules. I could not 
> find however how to reset this option on a subrule.
>
> 3.  A conditional Token::SKIP in the WS lexer rule:
>
> WS:	( ' '
> 	| '\t'
> 	| '\f'
> 	| ( options { generateAmbigWarnings = false; }
> 		: "\r\n"
> 		| '\r'
> 		| '\n'
> 		) { newline(); }
> 	)+ { if (preserveWS == false) _ttype = antlr::Token::SKIP; } // C++
> 	;
>
> In this case you would want the starting rule of the grammar to set 
> 'preserveWS' to 'false' and have the 'entry'-action of a subrule (near 
> where you are parsing the combinator) set 'preserveWS' to 'true'. This 
> leads to 2 problems:
>
> How can parser and lexer interact? As far as I could tell the parser 
> has no visibility to the lexer, only to the lexer's enclosed 
> tokenstream. This means I could add a method to the lexer: public: 
> setPreserveWS(bool mode = true) { this->preserveWS = mode; }. But I'm 
> unsure if I could ever call this method from the parser.
>
> There does not seem to be an 'exit'-action. How could the 'preserveWS' 
> be safely reset to 'false' when the combinator subrule has been 
> recognized/failed. Perhaps I would need to specify the same action in 
> every branch of the subrule.
>
> Looking at the note in the documentation regarding TokenStream 
> filtering this seems like the best alternative - no costly creation of 
> WS tokens when there is no need for them.
>
> 4. Use a variation on the 'TokenStreamBasicFilter'. This way the lexer 
> does not skip WS, but puts it in the TokenStream. One could make a 
> 'CustomTokenStreamFilter', that allows you to toggle preserveWS in the 
> filter. Except: how do I get to the filter (i.e. tokenstream) from the 
> parser? I managed to find: this->getInputState().getInput() to arrive 
> at the TokenBuffer. The TokenBuffer does not seem to have a (public) 
> method to produce it's associated TokenStream.
>
> Perhaps there are some other strategies that I didn't think of.
>
> Could someone help me with this?
>
> Thanks in advance.
>
> With kind regards,
> Peter Paulus
>