[antlr-interest] Context-sensitive lexing

Steve Bennett stevagewp at gmail.com
Sun Nov 18 18:55:52 PST 2007


Hi all,
  What's the general solution when you need to switch lexers
midstream? In the classic C case, for example, an asm {...} block
lexes and parses differently from normal code. A "mov" would be a
special token inside the asm block, but would be nothing in particular
outside it.

How can you handle this? I seem to recall there is a way to switch
individual lexer rules on and off dynamically (like a predicate), but
can't find it in the book. Even if there is, what do you do if you
need information at the parser information to recognise the situation?

I ask because the wikitext I'm parsing has a wide range of vocabulary
depending on the context. In normal text, almost anything goes. In an
image tag, lots of words have special meanings. In a table, suddenly
|- is a special token. In a template call, | is special. If I can't
actually tokenise any of these things (because they don't have meaning
everywhere), I seem to be back to testing regular expressions on
input.LT(1).getText() ?

I gather that most programming languages don't have this drama,
because there are generally two lexing situations: normal text, where
{ and -> are special tokens, or strings/comments, where /* blah -> {
blah */ is treated as a single token. But what would you do if you
wanted to actually parse the contents of that comment, rather than
making it a monolithic token?

It would be really nice to simply have two separate lexers, and switch
between them as needed. Is this possible?

Thanks all for any suggestions,
Steve


More information about the antlr-interest mailing list