[antlr-interest] grammar for folded lines

Fri Dec 4 15:27:00 PST 2009

I'm just starting Antlr after running into a wall trying to use
a state pattern with regular expressions to implement a DSL.

I have the first Antlr book, and this has been quite helpful so far.

One problem that I've run into is folded lines. The specification that 
I'm trying to write a grammar for says in part:

Any sequence of CRLF followed immediately by a single linear white space 
character is ignored (i.e., removed) when processing the content type.

When parsing a content line, folded lines MUST first be unfolded 
according to the unfolding procedure described above.

So, the way I'm reading this is that a folding token (' '|'\t') CRLF can 
come anywhere in the input stream and needs to be ignored before 
processing.

I did the following to discard a folding token between other tokens in a 
parsing rule.

id: (FOLD)=>
  | ID '=' ID ';' NEWLINE
  | NEWLINE
  ;

FOLD: (' '|'\t') NEWLINE {skip();} ;

NEWLINE: '\r'? '\n' ;

ID: ('a' .. 'z' | 'A' .. 'Z')+ ;

WS: (' '|'\t'|'\r'|'\n')+ {skip();} ;

This works fine when typing in:

cat=dog;
cat = dog;
cat
 = dog;

It fails when typing in:

ca
 t=dog;

I'm trying to get two ID tokens out of the last entry.

I'm obviously not understanding something fundamental. Hopefully I can 
accomplish this without filtering the input before the Antlr-generated 
code is used.

Pointers welcome.

Thanks in advance -  /mde/