[antlr-interest] detecting transitions in stanza-based files

Tue May 10 15:11:37 PDT 2005

Thanks for all your help everyone, I think I'm going to do a combination 
approach using a TokenFilter (perhaps extending the newer 
TokenStreamRewriteEngine) to add imaginary tokens to tag the beginning 
of stanzas and also remove extraneous DELIMs. Part of my difficulty in 
writing these parsers is many times people want them to work on csv-type 
files that have been mangled by excel. Excel likes to add enough 
delimiters at the end of every line so that all lines have an equal 
number of columns, this leads to lots of rules in my grammars that end 
in "(DELIM)* NEWLINE" which I understand can be inefficient and also 
lead to some nondeterminism difficulties. I am going to have my 
TokenFilter remove these. This will change my parser flow so the file 
goes through the lexer, goes through the filter parser, goes through my 
tree building parser, and then goes through the tree parser. Hopefully 
by simplifying the tree building parser this will be acceptably quick.

I plan to have a rule that matches a short line (the stanza 
headers/separators), one that matches a long line (actual data records) 
and one that matches 2 or more DELIMs in a row at the end of a line. The 
stanza header rule will add an imaginary token that marks the beginning 
of a stanza and the end of line rule will remove extraneous DELIMs. 
Hopefully this will work, the one problem I see is that having DELIM 
(DELIM)+ NEWLINE at the end of a line would lead to nondeterminisms for 
finite lookahead, so I will most likely need to make some sort of 
predicate system that matches all the possible types of lines (short 
with extra delims, short w/o extra delims, long with extra delims and 
long w/o extra delims).

I'll start work on this tomorrow so if anyone has any 
advice/input/pointers to examples/docs I'd appreciate it.

Thanks again!
Chris

Ric Klaren wrote:

> Chris Black wrote:
>
>> I decided perhaps paring down my query would make it a bit easier to 
>> read. Sorry for the initial long-winded post. My main problem is 
>> trying to detect a transition between lines of 3+ FIELDs long and one 
>> of less than 3 FIELDs. I have a token stream after the lexer has run 
>> like:
>
>
>> FIELD DELIM FIELD NEWLINE
>> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
>
> ...
>
>> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
>> FIELD DELIM FIELD NEWLINE
>> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
>
> ...
>
>> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
>>
>> My difficulty is detecting the transitions from a series of long 
>> lines to the short line separating the stanzas.
>> What seems to be happening is my rule to match a long line is trying 
>> to be applied to the short line since I am in a rule looking for any 
>> number of long lines. Why is this? To simplify, it seems like if I 
>> have a few rules like:
>>
>> multStanzas: (stanza)+
>> stanza: shortLine (longLine)+
>>
>> shortLine: FIELD DELIM FIELD DELIM FIELD NEWLINE
>> longLine: FIELD DELIM FIELD (DELIM FIELD)+ NEWLINE
>>
>> That it tries to match the whole file as one stanza. I thought that 
>> once the longLine match failed seeing a short line of less than three 
>> FIELDs that ANTLR would then try to match with a longLine rule. What 
>> am I missing or doing wrong?
>
>
> I guess that this might work it prevents entering the longLine rule if 
> there's a shortline on the input (without ridiculous k size):
>
> multStanzas: (stanza)+
> stanza: shortLine ( { if( LA(6) == NEWLINE ) break;  }: longLine)+
>
> Maybe also a check on EOF is necessary.
>
> I think a token filter approach might work as well. Put between the 
> lexer and parser a filter that inserts before every stanza a synthetic 
> token that marks the start of a line. Keep a reference to this start 
> marker. Then in the filter buffer up input to the first NEWLINE or EOF 
> whilst counting the number of fields so far. When you get to the 
> NEWLINE update the start marker's tokentype ($setType) to something 
> like SHORTLINE or LONGLINE. At that point you can pass the start 
> marker to the calling parser from the filter. And wait until the 
> calling parser has consumed the input the filter has read so far and 
> redo from start.
>
> Your actual parser would then see:
>
> shortline: SHORTLINE (FIELD DELIM)+ NEWLINE ;
> longline: LONGLINE (FIELD DELIM)+ NEWLINE ;
>
> My guess is that it would perform better than a syntactic predicate.
>
> Cheers,
>
> Ric