[antlr-interest] C++ TokenStreamSelector

Thu Feb 15 06:53:37 PST 2007

Hi,

On 2/15/07, John Reid <j.reid at mail.cryst.bbk.ac.uk> wrote:
> >> What is the recommended way to flush this buffer and force re-lexing of
> >> the input stream?
> >
> >
> > There is no such mechanism. You might get something to work with very
> > creative use of mark, rewind on the buffer and adding code to
> > invalidate/reset the state of the lookahead. But this requires a
> > *very* *very* good understanding of your parser and how it parses.
> > E.g. you have to mark the input at the start of a rule if you suspect
> > that a switch might be necessary and rewind and cleanup if it fails.
> > Or unregister the mark if it was not needed (e.g. no switch needed)
> > (in short: a maintenance nightmare)
> The token stream must know what input has been consumed and what is
> pending. I can't see why it could not re-lex the pending input but I
> have to admit I don't understand the antlr internals: so I'll take your
> word for it.

The problem is that antlr uses a recursive descent parser. E.g. the
call stack of the various parser rules is significant and might have
to be rewound depending on a number of factors. If you mix in guessing
mode then things get tricky.

> > I would not go tread way unless I *really* had no other option. E.g.
> > more passes, uses AST's.. maybe use tokenstream rewriting. It depends
> > on what you want to accomplish.
>
> My parsing problem is that sometimes fields in my text file are
> delimited by '.', ':', ';', and various other tokens. My problem is that
> in many cases these characters are part of the values of the fields and
> in other cases they are delimiters. I can only know which is which at
> parse time. So I thought what I was doing was the natural solution.

It may indeed feel the most intuitive, but the way antlr2 is built
makes this hard to do.

> Obviously I just misinterpreted the documentation!

I'm not sure if this is really explicit in the documentation. Although
I think there are some FAQ entries about it.

> Does anyone have any advice for how to approach this problem? None of
> the examples in the antlr documentation deal with this sort of grammar.

If you're using AST's: Is it an option to pass the tricky bits with
the delimiters as big chunk to the parser? And then deal with what
they are in a tree parser ? (E.g. refine the generated AST in an extra
pass, the good old divide and conquer) Or alternatively use a small
extra lexer (and maybe even parser) to parse the string with the chunk
when you encounter them and then built an AST.

Cheers,

Ric