[antlr-interest] ANTLR 2.7.6/C++: parser controlled conditional lexer whitespace skipping?
Peter Paulus
peter.paulus at nerocmediaware.nl
Mon Sep 11 03:01:39 PDT 2006
Hello all,
For a project I'm trying to create a CSS 2.1 parser (I started from the
ccs2.1.g shared grammar on the website).
The shared css2.1.g grammar has the following whitespace lexer rule:
WS: ( ' '
| '\t'
| '\f'
| ( options { generateAmbigWarnings = false; }
: "\r\n"
| '\r'
| '\n'
) { newline(); }
)+ { _ttype = antlr::Token::SKIP; } // C++
;
Now whitespace is both whitespace and a combinator in CSS2.1 (See
section 5.2 paragraph 3 of specification - "Was there no better
alternative for that particular combinator that was both human and
machine-readable? - ). So skipping whitespace in the lexer doesn't
look like a good idea.
I've been looking at some strategies as how to solve this, but got
stuck.
1. Handle whitespace explicitly in the parser. This look like a viable
strategy, but is probably a lot of (hopefully unneeded?) work.
2. Use the 'ignore=WS' option. For CSS 2.1. you'd have to ignore WS on
the starting rule of the grammar (it's whitespace most of the time).
As far as I could tell this propagates down into subrules. I could not
find however how to reset this option on a subrule.
3. A conditional Token::SKIP in the WS lexer rule:
WS: ( ' '
| '\t'
| '\f'
| ( options { generateAmbigWarnings = false; }
: "\r\n"
| '\r'
| '\n'
) { newline(); }
)+ { if (preserveWS == false) _ttype = antlr::Token::SKIP; } // C++
;
In this case you would want the starting rule of the grammar to set
'preserveWS' to 'false' and have the 'entry'-action of a subrule (near
where you are parsing the combinator) set 'preserveWS' to 'true'. This
leads to 2 problems:
How can parser and lexer interact? As far as I could tell the parser
has no visibility to the lexer, only to the lexer's enclosed
tokenstream. This means I could add a method to the lexer: public:
setPreserveWS(bool mode = true) { this->preserveWS = mode; }. But I'm
unsure if I could ever call this method from the parser.
There does not seem to be an 'exit'-action. How could the 'preserveWS'
be safely reset to 'false' when the combinator subrule has been
recognized/failed. Perhaps I would need to specify the same action in
every branch of the subrule.
Looking at the note in the documentation regarding TokenStream
filtering this seems like the best alternative - no costly creation of
WS tokens when there is no need for them.
4. Use a variation on the 'TokenStreamBasicFilter'. This way the lexer
does not skip WS, but puts it in the TokenStream. One could make a
'CustomTokenStreamFilter', that allows you to toggle preserveWS in the
filter. Except: how do I get to the filter (i.e. tokenstream) from the
parser? I managed to find: this->getInputState().getInput() to arrive
at the TokenBuffer. The TokenBuffer does not seem to have a (public)
method to produce it's associated TokenStream.
Perhaps there are some other strategies that I didn't think of.
Could someone help me with this?
Thanks in advance.
With kind regards,
Peter Paulus
More information about the antlr-interest
mailing list