[antlr-interest] detecting transitions in stanza-based files

Tue May 10 12:35:21 PDT 2005

Chris Black wrote:
> I decided perhaps paring down my query would make it a bit easier to 
> read. Sorry for the initial long-winded post. My main problem is trying 
> to detect a transition between lines of 3+ FIELDs long and one of less 
> than 3 FIELDs. I have a token stream after the lexer has run like:

> FIELD DELIM FIELD NEWLINE
> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
...
> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
> FIELD DELIM FIELD NEWLINE
> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
...
> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
> 
> My difficulty is detecting the transitions from a series of long lines 
> to the short line separating the stanzas.
> What seems to be happening is my rule to match a long line is trying to 
> be applied to the short line since I am in a rule looking for any number 
> of long lines. Why is this? To simplify, it seems like if I have a few 
> rules like:
> 
> multStanzas: (stanza)+
> stanza: shortLine (longLine)+
> 
> shortLine: FIELD DELIM FIELD DELIM FIELD NEWLINE
> longLine: FIELD DELIM FIELD (DELIM FIELD)+ NEWLINE
> 
> That it tries to match the whole file as one stanza. I thought that once 
> the longLine match failed seeing a short line of less than three FIELDs 
> that ANTLR would then try to match with a longLine rule. What am I 
> missing or doing wrong?

I guess that this might work it prevents entering the longLine rule if 
there's a shortline on the input (without ridiculous k size):

multStanzas: (stanza)+
stanza: shortLine ( { if( LA(6) == NEWLINE ) break;  }: longLine)+

Maybe also a check on EOF is necessary.

I think a token filter approach might work as well. Put between the 
lexer and parser a filter that inserts before every stanza a synthetic 
token that marks the start of a line. Keep a reference to this start 
marker. Then in the filter buffer up input to the first NEWLINE or EOF 
whilst counting the number of fields so far. When you get to the NEWLINE 
update the start marker's tokentype ($setType) to something like 
SHORTLINE or LONGLINE. At that point you can pass the start marker to 
the calling parser from the filter. And wait until the calling parser 
has consumed the input the filter has read so far and redo from start.

Your actual parser would then see:

shortline: SHORTLINE (FIELD DELIM)+ NEWLINE ;
longline: LONGLINE (FIELD DELIM)+ NEWLINE ;

My guess is that it would perform better than a syntactic predicate.

Cheers,

Ric