[antlr-interest] Context-sensitive lexing

Sun Nov 18 18:55:52 PST 2007

Hi all,
  What's the general solution when you need to switch lexers
midstream? In the classic C case, for example, an asm {...} block
lexes and parses differently from normal code. A "mov" would be a
special token inside the asm block, but would be nothing in particular
outside it.

How can you handle this? I seem to recall there is a way to switch
individual lexer rules on and off dynamically (like a predicate), but
can't find it in the book. Even if there is, what do you do if you
need information at the parser information to recognise the situation?

I ask because the wikitext I'm parsing has a wide range of vocabulary
depending on the context. In normal text, almost anything goes. In an
image tag, lots of words have special meanings. In a table, suddenly
|- is a special token. In a template call, | is special. If I can't
actually tokenise any of these things (because they don't have meaning
everywhere), I seem to be back to testing regular expressions on
input.LT(1).getText() ?

I gather that most programming languages don't have this drama,
because there are generally two lexing situations: normal text, where
{ and -> are special tokens, or strings/comments, where /* blah -> {
blah */ is treated as a single token. But what would you do if you
wanted to actually parse the contents of that comment, rather than
making it a monolithic token?

It would be really nice to simply have two separate lexers, and switch
between them as needed. Is this possible?

Thanks all for any suggestions,
Steve