[antlr-interest] detecting transitions in stanza-based files
Ric Klaren
ric.klaren at gmail.com
Tue May 10 12:35:21 PDT 2005
Chris Black wrote:
> I decided perhaps paring down my query would make it a bit easier to
> read. Sorry for the initial long-winded post. My main problem is trying
> to detect a transition between lines of 3+ FIELDs long and one of less
> than 3 FIELDs. I have a token stream after the lexer has run like:
> FIELD DELIM FIELD NEWLINE
> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
...
> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
> FIELD DELIM FIELD NEWLINE
> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
...
> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
>
> My difficulty is detecting the transitions from a series of long lines
> to the short line separating the stanzas.
> What seems to be happening is my rule to match a long line is trying to
> be applied to the short line since I am in a rule looking for any number
> of long lines. Why is this? To simplify, it seems like if I have a few
> rules like:
>
> multStanzas: (stanza)+
> stanza: shortLine (longLine)+
>
> shortLine: FIELD DELIM FIELD DELIM FIELD NEWLINE
> longLine: FIELD DELIM FIELD (DELIM FIELD)+ NEWLINE
>
> That it tries to match the whole file as one stanza. I thought that once
> the longLine match failed seeing a short line of less than three FIELDs
> that ANTLR would then try to match with a longLine rule. What am I
> missing or doing wrong?
I guess that this might work it prevents entering the longLine rule if
there's a shortline on the input (without ridiculous k size):
multStanzas: (stanza)+
stanza: shortLine ( { if( LA(6) == NEWLINE ) break; }: longLine)+
Maybe also a check on EOF is necessary.
I think a token filter approach might work as well. Put between the
lexer and parser a filter that inserts before every stanza a synthetic
token that marks the start of a line. Keep a reference to this start
marker. Then in the filter buffer up input to the first NEWLINE or EOF
whilst counting the number of fields so far. When you get to the NEWLINE
update the start marker's tokentype ($setType) to something like
SHORTLINE or LONGLINE. At that point you can pass the start marker to
the calling parser from the filter. And wait until the calling parser
has consumed the input the filter has read so far and redo from start.
Your actual parser would then see:
shortline: SHORTLINE (FIELD DELIM)+ NEWLINE ;
longline: LONGLINE (FIELD DELIM)+ NEWLINE ;
My guess is that it would perform better than a syntactic predicate.
Cheers,
Ric
More information about the antlr-interest
mailing list