[antlr-interest] Using a Parser as a TokenFilter

Thu May 12 01:17:34 PDT 2005

Hi,

On 5/11/05, Chris Black <chris at lotuscat.com> wrote:
> To start I'm just trying to do the killing of extra commas at the end of
> the line thing, I have something like what is at the end of this
> message. Not only does this give me a stack overflow error when it
> actually does encounter extra commas, but it also seems to cause an
> "unexpected token: null" error in the downline parser in other cases,
> even after adding an EOF at the end of the main rule. After
> building/running with -trace, I think this may have something to do with
> the lookahead being filled with nulls.

> // filter to change lines like "foo,bar,baz,,,,,,,," into "foo,bar,baz,"
>     public void consume() {
>         try {
>           if(LA(1) == DELIM && LA(2) == DELIM && LA(3) == DELIM) {
>               //System.out.println("skipping extra commas");
>               //System.out.flush();
>               queue.append(LT(1)); consumeUntil(NEWLINE);
>           } else {
>               queue.append(LT(1));
>           }
>           super.consume();
>         } catch(TokenStreamException e) {
>             System.err.println("error in consume");
>             System.err.println(e);
>             e.printStackTrace();
>         }
>     }

>     public Token nextToken() throws TokenStreamException {
>         Token ret;
>         if(queue.length() <= 0) {
>             try {
>                 line();
>             } catch(RecognitionException e) { ; }
>             catch(TokenStreamException e) { ; }
>         }
>         if(queue.length() > 0) {
>             ret = queue.elementAt(0);
>             queue.removeFirst();
>             return ret;
>         }
>         System.out.println("no more queue, returning EOF");
>         return new Token(Token.EOF_TYPE,"");
>     }
> }

Make sure that you get some input or a definite EOF situation when you
call line. You might have to put a while on the queue.lenght() <= 0 in
stead of an if. (depends on the implementation of the line rule)

> line:
>     (NEWLINE) => emptyLine
>     | ((FIELD | DELIM)+ NEWLINE) => contentLine
>     ;

I'd get rid of these predicates. They serve no purpose. My rule of
thumb: when you got a rule with every alternative guarded with a
syntactic predicate then you're probably doing something wrong (you
get rules that might not consume input and that's usually good for
strange stuff).

The lookahead you're looking at is not ambiguous and in the case of
erroneous input the rule might not consume anything. Also you're not
handling EOF... e.g. in case of EOF you'll get an RecognitionException
that gets eaten by nextToken (silently as well) (try adding a few
println's to the bits that eat exceptions, for a filter I wrote a
while back some of them needed extra handling).

The following rule is a lot simpler. Try to differentiate between the
things you want with some attributes in this filter (e.g. when you
call the line rule from nextToken set an attribute that you started a
line and reset a counter for the fields then update the counter in the
closure below. You might also want to set a flag in the (NEWLINE|EOF)
bit. So you can detect better how the line rule ended (and when the
next line will start!). I also miss the code that inserts the marker
at the start of the line.

line:
    (FIELD | DELIM)* ( NEWLINE | EOF )
    ;

Or something like:

line:
|  FIELD (DELIM FIELD)* (DELIM)* eol
|  (DELIM)* eol
;

eol:( NEWLINE | EOF );

Also first get things to work with the marking of the stanza's then
add the comma eating. When a filter starts eating input it might eat
all the input and that takes some extra handling if I recall right.
It's probably a good idea to let the line rule finish in nextToken
then check the tail of the queue for trailing comma's and nuke them
from the queue.

Tip: Read the code generated for the line rule and get a feel for how
it interacts with your consume & nextToken method. In this case it is
also feasible to handcode the filter since it's not that complex
parsing wise.

I'm afraid I might not be too coherent/clear in this post but there
should be some tips in it that might get you going again. I'll look
again at it when I'm at home again.

Cheers,

Ric