[antlr-interest] Too many uses for escape character giving me lexer troubles.

Tue Mar 13 18:51:59 PDT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm using ANTLR v3 (and quite liking it).

In my language (http://nolatte.sf.net/), the backslash character is the
escape character, and it gets used for (at least) two different tasks.
Here's a stripped down grammar:

atom		:  WORD | IDENTIFIER ;
WORD		:  ( ('a'..'z') | ( '\\' '{' ) )+ ;
IDENTIFIER	:   '\\' ('a'..'z')+ ;

The key is that the backslash gets used for two purposes: as a real
escape character (to escape '{' in a WORD) and as the beginning of an
IDENTIFIER.  The problem comes in when my grammar tries to scan and/or
parse something like this:

  abc\xyz

This should be two tokens: a WORD "abc" and an IDENTIFIER "\xyz".
However, since the backslash is allowed at all in a WORD, the lexer
consumes it, and then it gets confused by the 'x'.

In my ANTLR v2 version of my grammar, I separated out "\{" as its own word.

WORD	:  ( ('a'..'z')+ | ( '\\' '{' ) )  ;

However, then something like "abc\{\{\{xyz" turns into five WORD tokens;
my goal is to return just one WORD token in this case.  On the other
hand, "abc\xyz" scans just fine.

Is there a slick solution in ANTLR so that multiple '{'s can appear in a
WORD and an IDENTIFIER can follow immediately after a WORD?

jdf

P.S. the refactoring tools in ANTLRWorks were very helpful in scaling
down my language to be fit for an email message.  Keep up the great work!

- --
* Jeremy D. Frens * Professor, Computer Science * jdfrens at calvin.edu *
  ``In thirty seconds, you'll be dead, and I'll blow this
    place up, and I'll be home in time for corn flakes!''
                              -- Cohaagen, _Total_Recall_

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF91U/OcBu2deY79IRAv9OAKDAJ97beqO8oinhF4DJSVcz08SL+gCdGGOO
r++2Y78RKuJoaFulPYYUZ+M=
=Vimv
-----END PGP SIGNATURE-----