[antlr-interest] Re: lexer "modes" for XML parsing etc...

Sun Nov 20 10:54:24 PST 2005

Am Sonntag, den 20.11.2005, 19:38 +0100 schrieb Oliver Zeigermann:
> Yes, (2) is what I was wondering about when I said "dynamic". Sounds
> interesting...
> 
> Martin, how did you solve (2)?

I didn't. I ran into bugs with a Lexer that is controlled by the Parser
(not because of lookahead, but because of predicates. The problem is the
same though) and ended up with a mixture where some states were switched
by the Lexer and some by the Parser. That worked, but was really ugly
and hard to maintain. 

Now I'm working with a manually written Lexer that follows (1), e.g.
state switching is exclusively done by the Lexer. This works nicely,
except that a handwritten Lexer for a lexically complex (23 states, 200
different Token types) language is also a real pain. Slightly better as
there are no bugs in the interop between the lexer and the parser, as
it's only calling nextToken(), but still. This is why I'm trying to prod
Terence into providing better support for stateful lexers ;-)

Solving (2) would probably include identifying the sections where
different tokens are possible depending on the lookahead decision,
marking the character(!) stream and re-lexing the token(s) in the case
of mismatches. That is IMHO complete overkill. It should be possible to
pull down the rules about states etc. into the Lexer with any sane
language.

Martin

> >      2. the Lexing is controlled by the Parser. In this case the Parser
> >         tells the Lexer that the next token must be of a specific set.
> >         I've done that and it leads to big problems with lookahead. This
> >         can probably only be fixed by re-lexing the Tokens each time,
> >         either generally, or just if the Parser knows it's running into
> >         different lexing rules.
> >
> > The second will either give you a big performance hit or be very
> > complicated to implement in a general way with ANTLR, I guess.
> >
> > I think the first case can easily be solved in ANTLR, see the other
> > discussion we had. Support for this in ANTLR would be nice, as it's
> > really a mess to do that manually if it's more than just a single
> > boolean flag.
> >
> > Martin
> >
> >
>