[antlr-interest] Using a Parser as a TokenFilter

Chris Black chris at lotuscat.com
Thu May 12 07:52:46 PDT 2005


Ric Klaren wrote:

>Hi,
>
>On 5/11/05, Chris Black <chris at lotuscat.com> wrote:
>  
>
> [stuff deleted]
>
>>line:
>>    (NEWLINE) => emptyLine
>>    | ((FIELD | DELIM)+ NEWLINE) => contentLine
>>    ;
>>    
>>
>
>I'd get rid of these predicates. They serve no purpose. My rule of
>thumb: when you got a rule with every alternative guarded with a
>syntactic predicate then you're probably doing something wrong (you
>get rules that might not consume input and that's usually good for
>strange stuff).
>
>  
>
Good point. I have changed it to just use | alternatives.

>The lookahead you're looking at is not ambiguous and in the case of
>erroneous input the rule might not consume anything. Also you're not
>handling EOF... e.g. in case of EOF you'll get an RecognitionException
>that gets eaten by nextToken (silently as well) (try adding a few
>println's to the bits that eat exceptions, for a filter I wrote a
>while back some of them needed extra handling).
>
>  
>
Added some printlns in nextToken exception handlers, never printed 
anything. probably not a bad idea to have them in there though.

>The following rule is a lot simpler. Try to differentiate between the
>things you want with some attributes in this filter (e.g. when you
>call the line rule from nextToken set an attribute that you started a
>line and reset a counter for the fields then update the counter in the
>closure below. You might also want to set a flag in the (NEWLINE|EOF)
>bit. So you can detect better how the line rule ended (and when the
>next line will start!). I also miss the code that inserts the marker
>at the start of the line.
>
>line:
>    (FIELD | DELIM)* ( NEWLINE | EOF )
>    ;
>
>Or something like:
>
>line:
>|  FIELD (DELIM FIELD)* (DELIM)* eol
>|  (DELIM)* eol
>;
>
>eol:( NEWLINE | EOF );
>
>  
>
I switched to having the eol rule, this actually turned out to stop the 
spurrious unexpected null token errors I was seeing from the downline 
parser as well.

>Also first get things to work with the marking of the stanza's then
>add the comma eating. When a filter starts eating input it might eat
>all the input and that takes some extra handling if I recall right.
>It's probably a good idea to let the line rule finish in nextToken
>then check the tail of the queue for trailing comma's and nuke them
>from the queue.
>
>  
>
Probably a better approach, before getting your reply I gave up on the 
trailing DELIM eating for a bit and switched focus to adding imaginary 
tokens. This solved my downline problems on its own. I may do further 
work to kill the trailing delims to clean up the downstream parser 
anyway in the future.

>Tip: Read the code generated for the line rule and get a feel for how
>it interacts with your consume & nextToken method. In this case it is
>also feasible to handcode the filter since it's not that complex
>parsing wise.
>
>I'm afraid I might not be too coherent/clear in this post but there
>should be some tips in it that might get you going again. I'll look
>again at it when I'm at home again.
>
>Cheers,
>
>Ric
>  
>
I appreciate all the help! For the curious, here is what I have now:
header {
    package mypackage;
    import antlr.*;
}

class StanzaParser extends Parser;
options {
    importVocab=CSV;
    k=2;
}

tokens {
    STANZASEPARATOR;
}

{
    MyTokenQueue queue = new MyTokenQueue(8);
   
    public void consume() {
        try {
            queue.append(LT(1));
        } catch(TokenStreamException e) {
            System.err.println("error in consume");
            System.err.println(e);
            e.printStackTrace();
        }
      super.consume();
    }
   
    public Token nextToken() throws TokenStreamException {
        Token ret;
        if(queue.length() <= 0) {
            try {
                line();
            } catch(RecognitionException e) {
                System.err.println("recog exception in nextToken");
                System.err.println(e);
                e.printStackTrace();
            }
            catch(TokenStreamException e) {
                System.err.println("tokenstream exception in nextToken");
                System.err.println(e);
                e.printStackTrace();
            }
        }
        if(queue.length() > 0) {
            ret = queue.elementAt(0);
            queue.removeFirst();
            return ret;
        }
        System.out.println("no more queue, returning EOF");
        return new Token(Token.EOF_TYPE,"end of file");
    }
}

line: (emptyLine | contentLine | delim1stLine) ;

emptyLine: eol ;

delim1stLine: DELIM (FIELD | DELIM)+ eol ;

contentLine: firstTok:FIELD
    {
        String firstText = firstTok.getText();
        if(firstText.startsWith("Data Type") || 
firstText.startsWith("DataType")
            || firstText.equals("Count") || firstText.equals("Result")) {
            queue.append(new Token(STANZASEPARATOR,"stanza sep"));
        }
    }
    (FIELD | DELIM)* eol
    ;
   
eol: (NEWLINE | EOF) ;


More information about the antlr-interest mailing list