[antlr-interest] Re: lexer "modes" for XML parsing etc...

Sun Nov 20 11:42:32 PST 2005

On Nov 20, 2005, at 10:54 AM, Martin Probst wrote:
> Now I'm working with a manually written Lexer that follows (1), e.g.
> state switching is exclusively done by the Lexer. This works nicely,
> except that a handwritten Lexer for a lexically complex (23 states,  
> 200
> different Token types) language is also a real pain. Slightly  
> better as
> there are no bugs in the interop between the lexer and the parser, as
> it's only calling nextToken(), but still. This is why I'm trying to  
> prod
> Terence into providing better support for stateful lexers ;-)

Your wish is my command.  ;)  Do we need something like

lexer grammar L;

ID : ... ;
SQLSTART : "sql(" {pushContext(SQL);} ;
WS : ... ;

context SQL {
ID : ... ;
ACTION : ...;
STRING : ... ;
ENDSQL : ')' {popContext();}
}

context island2 {
...
}

[note the push/pop rather than simple set; very useful]

Then, the lexer would simply generate multiple Tokens-like rules for  
all contexts?  You see a different lexer entry rule for each  
context.  How do you switch?  We'd need an int constant (as we have  
no function poitners in Java--a pox on their family) that would jump  
to the right starting method.

Sounds easy.  Is this what we want?  It is proper for island grammars  
that feed off the same input stream.  Multiple input streams like  
include files need to be handled with a multiplexing input buffer.

> Solving (2) would probably include identifying the sections where
> different tokens are possible depending on the lookahead decision,
> marking the character(!) stream and re-lexing the token(s) in the case
> of mismatches. That is IMHO complete overkill. It should be  
> possible to
> pull down the rules about states etc. into the Lexer with any sane
> language.

Agreed.  THat is really hard.

ter