[antlr-interest] grammar for folded lines
Gavin Lambert
antlr at mirality.co.nz
Sat Dec 5 01:33:28 PST 2009
At 12:27 5/12/2009, Mark Eggers wrote:
>Any sequence of CRLF followed immediately by a single linear
white
>space character is ignored (i.e., removed) when processing the
>content type.
[...]
>So, the way I'm reading this is that a folding token (' '|'\t')
>CRLF can come anywhere in the input stream and needs to be
>ignored before processing.
[...]
>FOLD: (' '|'\t') NEWLINE {skip();} ;
It's actually the other way around -- newline followed by space --
that is the specified folding condition.
>id: (FOLD)=>
> | ID '=' ID ';' NEWLINE
> | NEWLINE
> ;
Since you have specified that FOLD tokens are skipped, you cannot
refer to them in the parser.
>NEWLINE: '\r'? '\n' ;
[...]
>WS: (' '|'\t'|'\r'|'\n')+ {skip();} ;
Bear in mind that these tokens are ambiguous -- if you get a
single CRLF you'll get a NEWLINE token (which the parser will see)
but CRLFCRLF (or any other combination of newlines and additional
whitespace) in the input will be seen as WS (which the parser
won't see). If newlines are significant to your parser then you
shouldn't be skipping them like this. (And if they're not then
you shouldn't have a NEWLINE token.)
>It fails when typing in:
>
>ca
> t=dog;
>
>I'm trying to get two ID tokens out of the last entry.
>
>I'm obviously not understanding something fundamental. Hopefully
I
>can accomplish this without filtering the input before the
>Antlr-generated code is used.
That actually is your best bet, particularly since the line
folding is occurring in the middle of a token. While there are
ways you can deal with this and re-stitch things in the ANTLR
lexer in a single pass, it will be much more error-prone and
ugly. The simplest thing to do is to write a custom CharStream
filter that takes care of the folding, sitting between the file
stream and the lexer.
More information about the antlr-interest
mailing list