[antlr-interest] SUCCESS! (mostly) detecting transitions in stanza-based files

Wed May 11 14:46:02 PDT 2005

First of all, thanks to everyone on the list who read my message, 
especially those who replied :)
The main issue I was having was that with a bunch of extra commas added 
at the end of lines by excel as well as the lack of a newline between 
stanzas, I was having trouble detecting the transitions between stanzas. 
After much reading and fiddling around, I now have a preliminary 
TokenFilter that ONLY looks to analyze these transitions, and when it 
finds one it adds an imaginary token. This approach was proposed by Ric 
it worked out quite well and did not seem to slow things down too much. 
I followed Monty's article 
http://www.codetransform.com/filterexample.html. I had to create my own 
TokenQueue to avoid modifying core antlr classes. It would be nice if 
the ANTLR distribution's TokenQueue was public and had a length method 
to facilitate this sort of filtering for others.

Now to my remaining (small) issue :)  For some reason when my 
parser/treeparser take input from the filtered TokenStream, I get a 
spurrious message to stderr:
line n:1: unexpected token: null

I've looked this up and googled around, but none of the fixes suggested 
get rid of this. I've added token matches for EOF as well as changing my 
filter to put out an extra NEWLINE rather than an EOF. I've noticed that 
I don't ever get the println I added to the return EOF condition in 
nextToken(). Any ideas?

Thanks again everyone,
Chris

For those who are curious, here is my token filter and the glue that 
makes it usable:
header {
    package mypackage;
    import antlr.*;
}

class StanzaParser extends Parser;
options {
    importVocab=CSV;
    k=2;
}

tokens {
    STANZASEPARATOR;
}

{
    MyTokenQueue queue = new MyTokenQueue(8);

    public void consume() {
        try {
            queue.append(LT(1));
        } catch(TokenStreamException e) {
            System.err.println("error in consume");
            System.err.println(e);
            e.printStackTrace();
        }
      super.consume();
    }

    public Token nextToken() throws TokenStreamException {
        Token ret;
        if(queue.length() <= 0) {
            try {
                line();
            } catch(RecognitionException e) { ; }
            catch(TokenStreamException e) { ; }
        }
        if(queue.length() > 0) {
            ret = queue.elementAt(0);
            queue.removeFirst();
            return ret;
        }
        System.out.println("no more queue, returning EOF");
        return new Token(Token.EOF_TYPE,"end of file");
    }
}

line:
    (NEWLINE) => emptyLine
    | ((FIELD | DELIM)+ NEWLINE) => contentLine
    | (DELIM (FIELD | DELIM)+ NEWLINE) => delim1stLine
    ;

emptyLine: NEWLINE ;

delim1stLine: DELIM (FIELD | DELIM)+ NEWLINE ;

contentLine: firstTok:FIELD
    {
        String firstText = firstTok.getText();
        if(firstText.startsWith("Data Type") || 
firstText.startsWith("DataType")
            || firstText.equals("Count") || firstText.equals("Result")) {
            queue.append(new Token(STANZASEPARATOR,"stanza sep"));
        }
    }
    (FIELD | DELIM)* NEWLINE
    ;

--- and here is the glue ---
package mypackage;

import antlr.*;

/**
 * A filtering TokenStream that adds special markers at stanza separations
 * to make downstream parsing much easier
 */
public class StanzaMarker implements TokenStream {
  StanzaParser filter;

  public StanzaMarker(TokenStream input) {
    filter = new StanzaParser(input);
  }

  public Token nextToken() throws TokenStreamException {
    Token tok = filter.nextToken();
    return tok;
  }
}

Chris Black wrote:

> Thanks for all your help everyone, I think I'm going to do a 
> combination approach using a TokenFilter (perhaps extending the newer 
> TokenStreamRewriteEngine) to add imaginary tokens to tag the beginning 
> of stanzas and also remove extraneous DELIMs. Part of my difficulty in 
> writing these parsers is many times people want them to work on 
> csv-type files that have been mangled by excel. Excel likes to add 
> enough delimiters at the end of every line so that all lines have an 
> equal number of columns, this leads to lots of rules in my grammars 
> that end in "(DELIM)* NEWLINE" which I understand can be inefficient 
> and also lead to some nondeterminism difficulties. I am going to have 
> my TokenFilter remove these. This will change my parser flow so the 
> file goes through the lexer, goes through the filter parser, goes 
> through my tree building parser, and then goes through the tree 
> parser. Hopefully by simplifying the tree building parser this will be 
> acceptably quick.
>
> I plan to have a rule that matches a short line (the stanza 
> headers/separators), one that matches a long line (actual data 
> records) and one that matches 2 or more DELIMs in a row at the end of 
> a line. The stanza header rule will add an imaginary token that marks 
> the beginning of a stanza and the end of line rule will remove 
> extraneous DELIMs. Hopefully this will work, the one problem I see is 
> that having DELIM (DELIM)+ NEWLINE at the end of a line would lead to 
> nondeterminisms for finite lookahead, so I will most likely need to 
> make some sort of predicate system that matches all the possible types 
> of lines (short with extra delims, short w/o extra delims, long with 
> extra delims and long w/o extra delims).
>
> I'll start work on this tomorrow so if anyone has any 
> advice/input/pointers to examples/docs I'd appreciate it.
>
> Thanks again!
> Chris
>
> Ric Klaren wrote:
>
>> Chris Black wrote:
>>
>>> I decided perhaps paring down my query would make it a bit easier to 
>>> read. Sorry for the initial long-winded post. My main problem is 
>>> trying to detect a transition between lines of 3+ FIELDs long and 
>>> one of less than 3 FIELDs. I have a token stream after the lexer has 
>>> run like:
>>
>>
>>
>>> FIELD DELIM FIELD NEWLINE
>>> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
>>
>>
>> ...
>>
>>> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
>>> FIELD DELIM FIELD NEWLINE
>>> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
>>
>>
>> ...
>>
>>> FIELD DELIM FIELD DELIM FIELD DELIM FIELD DELIM NEWLINE
>>>
>>> My difficulty is detecting the transitions from a series of long 
>>> lines to the short line separating the stanzas.
>>> What seems to be happening is my rule to match a long line is trying 
>>> to be applied to the short line since I am in a rule looking for any 
>>> number of long lines. Why is this? To simplify, it seems like if I 
>>> have a few rules like:
>>>
>>> multStanzas: (stanza)+
>>> stanza: shortLine (longLine)+
>>>
>>> shortLine: FIELD DELIM FIELD DELIM FIELD NEWLINE
>>> longLine: FIELD DELIM FIELD (DELIM FIELD)+ NEWLINE
>>>
>>> That it tries to match the whole file as one stanza. I thought that 
>>> once the longLine match failed seeing a short line of less than 
>>> three FIELDs that ANTLR would then try to match with a longLine 
>>> rule. What am I missing or doing wrong?
>>
>>
>>
>> I guess that this might work it prevents entering the longLine rule 
>> if there's a shortline on the input (without ridiculous k size):
>>
>> multStanzas: (stanza)+
>> stanza: shortLine ( { if( LA(6) == NEWLINE ) break;  }: longLine)+
>>
>> Maybe also a check on EOF is necessary.
>>
>> I think a token filter approach might work as well. Put between the 
>> lexer and parser a filter that inserts before every stanza a 
>> synthetic token that marks the start of a line. Keep a reference to 
>> this start marker. Then in the filter buffer up input to the first 
>> NEWLINE or EOF whilst counting the number of fields so far. When you 
>> get to the NEWLINE update the start marker's tokentype ($setType) to 
>> something like SHORTLINE or LONGLINE. At that point you can pass the 
>> start marker to the calling parser from the filter. And wait until 
>> the calling parser has consumed the input the filter has read so far 
>> and redo from start.
>>
>> Your actual parser would then see:
>>
>> shortline: SHORTLINE (FIELD DELIM)+ NEWLINE ;
>> longline: LONGLINE (FIELD DELIM)+ NEWLINE ;
>>
>> My guess is that it would perform better than a syntactic predicate.
>>
>> Cheers,
>>
>> Ric
>
>
>