[antlr-interest] Too many uses for escape character giving me lexer troubles.

Jeremy D. Frens jdfrens at calvin.edu
Mon Mar 26 17:54:38 PDT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Eric Deplagne wrote:
> On Wed, 14 Mar 2007 21:37:07 -0400, Jeremy D. Frens wrote:
>> Terence Parr wrote:
>>> On Mar 13, 2007, at 6:51 PM, Jeremy D. Frens wrote:
>>>> In my language (http://nolatte.sf.net/), the backslash character is the
>>>> escape character, and it gets used for (at least) two different tasks.
>>>> Here's a stripped down grammar:
>>>>
>>>> atom        :  WORD | IDENTIFIER ;
>>>> WORD        :  ( ('a'..'z') | ( '\\' '{' ) )+ ;
>>>> IDENTIFIER    :   '\\' ('a'..'z')+ ;
>>>>
>>>> The key is that the backslash gets used for two purposes: as a real
>>>> escape character (to escape '{' in a WORD) and as the beginning of an
>>>> IDENTIFIER.  The problem comes in when my grammar tries to scan and/or
>>>> parse something like this:
>>>>
>>>>   abc\xyz
>>>>
>>>> This should be two tokens: a WORD "abc" and an IDENTIFIER "\xyz".
>>>> However, since the backslash is allowed at all in a WORD, the lexer
>>>> consumes it, and then it gets confused by the 'x'.
>>> try putting ID before WORD
>> Same problem.  Three more observations:
> 
>   I would simply not do that at lexer level.
> 
>   What would the following give ?:
> 
>     atom : word | identifier;
>     word : ( LOWCASE | BACKSLASH OBRACE )+;
>     identifier : BACKSLASH LOWCASE+
>     BACKSLASH : '\\';
>     OBRACE : '{';
>     LOWCASE : 'a'..'z';
> 

In case anyone cares, I opted for this approach.  "word" is recognized
by the parser; it looks for "SYLLABLE"s which might be raw text, escaped
curly braces, or some other strange metacharacters in the language.

word	:	syllable+
		-> ^(WORD syllable+)
	;
fragment
syllable
	:	( SYLLABLE | EQUALS | AMPERSAND )
	;

SYLLABLE can be "raw text" *xor* "\{", "\}", "\\".  (Check out
http://tinyurl.com/3c8dbl , if you're *really* interested in what the
grammar looks like.  It's the whole grammar file as found in my SVN
repository.)

I decided *not* to go with a lexer solution which re-set the type
because it just didn't feel as clean to me.

jdf

- --
* Jeremy D. Frens * Professor, Computer Science * jdfrens at calvin.edu *
         ``I would put exclamation points at the end of each
           of these sentences!  This one!  And that one!''
                          -- Elaine, _Seinfeld_

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGCGtOOcBu2deY79IRAlRgAJ9HnPIAoIazgExfehrXVj4QpwzeswCfRMfz
qogWzzfD5Szg0vkk09rTJcc=
=85qZ
-----END PGP SIGNATURE-----


More information about the antlr-interest mailing list