[antlr-interest] Context-sensitive lexing

Mon Nov 19 00:41:32 PST 2007

At 20:52 19/11/2007, Steve Bennett wrote:
 >4) tokenize magicwords but feed them back into the general 
letters
 >pool whenever they're not needed:  letters: ('a'..'z' | MAGIC)+;

That's usually the approach I use.  Although not quite like that 
:)

 >I've tried 1, 3 and 4 and they all work. However, 3 and 4 have
 >major impacts on how the rest of the grammar will be shaped, I
 >think. Also 4 has the odd behaviour of generating nodes with
 >clumps of tokens: "magicword" will get lexed as "magic" and
 >"word" then parsed as MAGIC+'w'+'o'+'r'+'d'.

If you do it like you've got above, yeah.  But you can still 
combine sequences of characters that don't happen to be magic:

MAGIC: 'magic';
TEXT: ('a'-'z')+;
text: (TEXT | MAGIC)+;

You'll get MAGIC('magic'),TEXT('word').

I usually prefer to put literal tokens (like MAGIC) in a tokens 
block, though.  Makes 'em easier to find.  And I think you get an 
ambiguity warning if you don't (though it'll still do the right 
thing).

 >How does one decide what method is the best?

Personal taste, mostly :)