[antlr-interest] Context-sensitive lexing

Sun Nov 18 23:52:52 PST 2007

On 11/19/07, Gavin Lambert <antlr at mirality.co.nz> wrote:
> Ko, have a look at the "island grammars" example.  And remember
> that since lexing occurs before parsing, you can't use any parser
> context to influence this changeover.

Yeah, I just found a good but short example in the url I referred to
in my last email. The question will be whether or not I can be certain
of the context just from lexing. It could be tough in some cases. But
it would certainly be a lot nicer to parse magic words as actual
tokens rather than just strings that have some particular text. More
readable etc.

> Not necessarily.  You can tokenise them as barebones (eg. PIPE and
> HYPHEN) and then figure out whether it means something special in
> the parser.  You'll need to be careful though if you're creating
> any hidden or off-channel tokens (eg. comments or whitespace),
> since the parser will ignore them and happily treat "| -" exactly
> the same as "|-" (if you're hiding whitespace).  So you'll either
> need to avoid hiding things or create separate tokens for your
> composites (eg. PIPEHYPHEN), which will look a bit messier.

Yeah, I haven't worked through the implications of hidden channels
yet. I think it's unlikely I'll be using them - the grammar I'm
working with is so finicky that it's a lot safer I think to have a lot
of PIPE ws LETTERS ws DOT ws etc than to blithely accept whitespace
everywhere.

Naively, I had assumed there would pretty much be one way to write a
given grammar. I'm now discovering in this case there are at least
four really different ways:

1) use semantic predicates to look at the text of a generic token
2) tokenize individual characters, and make the magic words sequences
of characters: magic: 'm' 'a' 'g' 'i' 'c'...
3) use semantic predicates in the lexer (are they called that, or
something different) to switch lexer rules on and off, as discussed
above
4) tokenize magicwords but feed them back into the general letters
pool whenever they're not needed:  letters: ('a'..'z' | MAGIC)+;

I've tried 1, 3 and 4 and they all work. However, 3 and 4 have major
impacts on how the rest of the grammar will be shaped, I think. Also 4
has the odd behaviour of generating nodes with clumps of tokens:
"magicword" will get lexed as "magic" and "word" then parsed as
MAGIC+'w'+'o'+'r'+'d'.

1 is at least totally independent of anything else, I think.

How does one decide what method is the best?

Steve