[antlr-interest] "Contextual lexing": Ideas on "nested parsers" and "lexing parsers."

Tue Nov 27 06:52:57 PST 2007

On 11/27/07, Harald Mueller <harald_m_mueller at gmx.de> wrote:
> In ANTLR 3, we do not have multi-lexer streams (as far as I know; the main reason is that lexing happens - at least conceptually - in one big swoosh before parsing starts). However, in many cases, it is still possible to use a solution akin to the ANTLR 2 version as follows: Very often, when a document is in some "outer language," which contains segments written in some other language ("inner language"), the segments can be read as complete tokens in the outer language.

I think I would understand ANTLR a lot better if I knew its history a
bit better :) To me, the "swoosh" approach feels like a limitation,
but it's actually an improvement on what there was before? I keep
wishing that the lexer was an on-demand stream that could be told to
relex tokens when required.

For example, in the wiki text I'm parsing, ]] is virtually always the
end of an internal link - a single token. But occasionally, it's a
single square bracket at the end of an exernal link.
[[http://foo.com]] is dopey code, but it's technically valid - an
external link wrapped in literal square brackets.

So it would be convenient to treat ]] as a token, but in those rare
cases, tell the lexer to throw out the tokens it's generated, read in
just a single ] then revert to normal. But if all the lexing happens
in one "swoosh" before the parser, that's obviously not possible.

As it is, I tend to use a solution a bit like you suggest:

RIGHT_BRACKET: ']';
external_link_end: RIGHT_BRACKET;
internal_link_end: RIGHT_BRACKET RIGHT_BRACKET;

But it's obviously not perfect: the token is clearly ']]' not ']'
followed by ']'.

Steve