[antlr-interest] grammar for folded lines

Sat Dec 5 01:33:28 PST 2009

At 12:27 5/12/2009, Mark Eggers wrote:
 >Any sequence of CRLF followed immediately by a single linear 
white
 >space character is ignored (i.e., removed) when processing the
 >content type.
[...]
 >So, the way I'm reading this is that a folding token (' '|'\t')
 >CRLF can come anywhere in the input stream and needs to be
 >ignored before processing.
[...]
 >FOLD: (' '|'\t') NEWLINE {skip();} ;

It's actually the other way around -- newline followed by space -- 
that is the specified folding condition.

 >id: (FOLD)=>
 >  | ID '=' ID ';' NEWLINE
 >  | NEWLINE
 >  ;

Since you have specified that FOLD tokens are skipped, you cannot 
refer to them in the parser.

 >NEWLINE: '\r'? '\n' ;
[...]
 >WS: (' '|'\t'|'\r'|'\n')+ {skip();} ;

Bear in mind that these tokens are ambiguous -- if you get a 
single CRLF you'll get a NEWLINE token (which the parser will see) 
but CRLFCRLF (or any other combination of newlines and additional 
whitespace) in the input will be seen as WS (which the parser 
won't see).  If newlines are significant to your parser then you 
shouldn't be skipping them like this.  (And if they're not then 
you shouldn't have a NEWLINE token.)

 >It fails when typing in:
 >
 >ca
 > t=dog;
 >
 >I'm trying to get two ID tokens out of the last entry.
 >
 >I'm obviously not understanding something fundamental. Hopefully 
I
 >can accomplish this without filtering the input before the
 >Antlr-generated code is used.

That actually is your best bet, particularly since the line 
folding is occurring in the middle of a token.  While there are 
ways you can deal with this and re-stitch things in the ANTLR 
lexer in a single pass, it will be much more error-prone and 
ugly.  The simplest thing to do is to write a custom CharStream 
filter that takes care of the folding, sitting between the file 
stream and the lexer.