[antlr-interest] ANTLR 2.7.6/C++: parser controlled conditional lexer whitespace skipping?
Peter Paulus
peter.paulus at nerocmediaware.nl
Fri Sep 15 02:39:04 PDT 2006
Hello all,
Meanwhile I've completed a large part of my first strategy: explicit
whitespace in the grammar.
The grammar below does parse successfully. Below the grammar and a
(non-coherent) sample input are in included.
On the declarationBlock rule however I keep having difficulties that
the last declaration, if specified with the "!important" clause, may
not omit the SEMICOLON. Whereas a declaration, without the "!important"
clause, may have an optional SEMICOLOM, optional whitespace before the
RBRACE.
For instance these are okay:
{border: green solid 1pt}
{border: green solid 1pt;}
{border: green solid 1pt; }
{border: green solid 1pt !important;}
But this isn't:
{border: green solid 1pt !important}
I keep getting the non-determenism between 1 alternative of expr and
the exit block.
By the way I found that whitespace had a third overloaded meaning: it
acts as a list separator in expr.
Could someone help me iron out this last bit?
On the other strategies, did I miss the obvious; Or was I not
netiquette enough with my little CSS whitespace observation? I had
certainly hoped there would be an alternate approach to landing this.
But since no one responded I guess this question stays open.
As far as what I've come up with concerns. I still have to do comments
which probably adds an other layer of complexity.
With kind regards,
Peter Paulus
ANTLR 2.7.6 grammar (on request I can give you the entire grammar,
including lexer):
class CSSParser extends Parser;
options {
k=3;
buildAST=true;
}
tokens {
FUNCTION;
DECLARATION;
}
stylesheet: (WS!)? (charset (WS!)?)? (import (WS!)?)* ((medium |
ruleset | page) (WS!)?)*;
charset: CHARSET_SYMBOL^ WS! string (WS!)? SEMICOLON;
import: IMPORT_SYMBOL^ WS! string (WS!)? (IDENT (COMMA (WS!)? IDENT
(WS!)?)*)? SEMICOLON;
medium: MEDIA_SYMBOL^
WS!
IDENT
(WS!)?
(COMMA (WS!)? IDENT (WS!)?)*
LBRACE (WS!)?
(ruleset (WS!)?)*
(RBRACE | EOF)
;
ruleset: compositeselector
LBRACE
(WS!)?
(declarationBlock)?
(RBRACE | EOF)
;
page: PAGE_SYMBOL^
(WS!)?
(COLON IDENT (WS!)?)?
LBRACE
(WS!)?
(declarationBlock)?
(RBRACE | EOF)
;
compositeselector: selector
(
((WS)? COMMA)=> (WS!)? COMMA (WS!)? selector
| ((WS)? PLUS)=> (WS!)? PLUS (WS!)?selector
| ((WS)? GREATER) => (WS!)? GREATER (WS!)? selector
| (WS LBRACE) => WS!
| WS selector
)*
;
selector: (typeselector | universalselector) (idselector |
classselector | attributeselector)* (pseudo)?
| (idselector | classselector | attributeselector)+ (pseudo)?
;
typeselector: IDENT;
universalselector: STAR;
idselector: HASH^ IDENT;
classselector: DOT^ IDENT;
attributeselector: LSQUARE^ RSQUARE; // incomplete rule
pseudo: COLON^ IDENT;
declarationBlock: declaration
(
(SEMICOLON (WS)? (RBRACE|EOF))=> SEMICOLON (WS!)?
| SEMICOLON (WS!)? declaration
)*
;
declaration: id:IDENT^
(WS!)?
COLON
(WS!)?
expr
(IMPORTANT_SYMBOL)?
{ #id->setType(DECLARATION); }
;
expr: term
(
(WS (SEMICOLON | RBRACE | EOF | IMPORTANT_SYMBOL))=> WS!
| (WS term)=> WS term
| (WS COMMA)=> WS! COMMA (WS!)? term
| (WS SLASH)=> WS! SLASH (WS!)? term
| SLASH (WS!)? term
| COMMA (WS!)? term
)*
;
term
: ((PLUS|MINUS)? DIGIT)=> length
| hexcolor
| function // including rgb(), srgb(), url()
| string
| identifier // including wellknown colors
| keyword
;
string: STRING;
keyword: AUTO | INHERIT;
identifier: IDENT;
function: id:IDENT^ LPARENTHESIS arguments RPARENTHESIS {
#id->setType(FUNCTION); };
arguments: (term)? ((COMMA|SLASH) WS term)*;
hexcolor : hexLiteral;
hexLiteral: HASH (HEX_DIGIT HEX_DIGIT HEX_DIGIT) (HEX_DIGIT HEX_DIGIT
HEX_DIGIT)?;
length: NUM^ (unit)?;
unit: PERCENT | PX | PT | PC | CM | MM | IN | DEG | RAD | GRAD | MS |
S | KHZ | HZ;
-------------- next part --------------
A non-text attachment was scrubbed...
Name: table.css
Type: model/vrml
Size: 527 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20060915/9a727744/attachment.vrml
-------------- next part --------------
Begin forwarded message:
> From: Peter Paulus <peter.paulus at nerocmediaware.nl>
> Date: Mon Sep 11, 2006 12:01:39 Europe/Amsterdam
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] ANTLR 2.7.6/C++: parser controlled
> conditional lexer whitespace skipping?
>
> Hello all,
>
> For a project I'm trying to create a CSS 2.1 parser (I started from
> the ccs2.1.g shared grammar on the website).
>
> The shared css2.1.g grammar has the following whitespace lexer rule:
>
> WS: ( ' '
> | '\t'
> | '\f'
> | ( options { generateAmbigWarnings = false; }
> : "\r\n"
> | '\r'
> | '\n'
> ) { newline(); }
> )+ { _ttype = antlr::Token::SKIP; } // C++
> ;
>
> Now whitespace is both whitespace and a combinator in CSS2.1 (See
> section 5.2 paragraph 3 of specification - "Was there no better
> alternative for that particular combinator that was both human and
> machine-readable? - ). So skipping whitespace in the lexer doesn't
> look like a good idea.
>
> I've been looking at some strategies as how to solve this, but got
> stuck.
>
> 1. Handle whitespace explicitly in the parser. This look like a viable
> strategy, but is probably a lot of (hopefully unneeded?) work.
>
> 2. Use the 'ignore=WS' option. For CSS 2.1. you'd have to ignore WS on
> the starting rule of the grammar (it's whitespace most of the time).
> As far as I could tell this propagates down into subrules. I could not
> find however how to reset this option on a subrule.
>
> 3. A conditional Token::SKIP in the WS lexer rule:
>
> WS: ( ' '
> | '\t'
> | '\f'
> | ( options { generateAmbigWarnings = false; }
> : "\r\n"
> | '\r'
> | '\n'
> ) { newline(); }
> )+ { if (preserveWS == false) _ttype = antlr::Token::SKIP; } // C++
> ;
>
> In this case you would want the starting rule of the grammar to set
> 'preserveWS' to 'false' and have the 'entry'-action of a subrule (near
> where you are parsing the combinator) set 'preserveWS' to 'true'. This
> leads to 2 problems:
>
> How can parser and lexer interact? As far as I could tell the parser
> has no visibility to the lexer, only to the lexer's enclosed
> tokenstream. This means I could add a method to the lexer: public:
> setPreserveWS(bool mode = true) { this->preserveWS = mode; }. But I'm
> unsure if I could ever call this method from the parser.
>
> There does not seem to be an 'exit'-action. How could the 'preserveWS'
> be safely reset to 'false' when the combinator subrule has been
> recognized/failed. Perhaps I would need to specify the same action in
> every branch of the subrule.
>
> Looking at the note in the documentation regarding TokenStream
> filtering this seems like the best alternative - no costly creation of
> WS tokens when there is no need for them.
>
> 4. Use a variation on the 'TokenStreamBasicFilter'. This way the lexer
> does not skip WS, but puts it in the TokenStream. One could make a
> 'CustomTokenStreamFilter', that allows you to toggle preserveWS in the
> filter. Except: how do I get to the filter (i.e. tokenstream) from the
> parser? I managed to find: this->getInputState().getInput() to arrive
> at the TokenBuffer. The TokenBuffer does not seem to have a (public)
> method to produce it's associated TokenStream.
>
> Perhaps there are some other strategies that I didn't think of.
>
> Could someone help me with this?
>
> Thanks in advance.
>
> With kind regards,
> Peter Paulus
>
More information about the antlr-interest
mailing list