[antlr-interest] Match any unicode character

Harald Mueller harald_m_mueller at gmx.de
Sun Nov 25 01:43:28 PST 2007


Tokenization ("lexing") happens before and without knowledge of syntax analysis ("parsing"). So there is no direct way to tell the lexer from the parser "now behave differently". There are two solutions that I have used:

(a) Do part of the tokenization in the parser: Your lexer has a restricted view of words

WORD : ~('\r' | '\n' | WS | '[' | '!' | ...)+;

or whatever you need to make your lexer deterministic. Maybe you can write this as 

SPECIAL : '[' | '!' | ...;
WORD : ~('\r' | '\n' | WS | SPECIAL)+;

Your parser has extended words:

word: WORD | SPECIAL;

(BTW, Microsoft's idea that you extend the C# language with new "identifier keywords" (e.g. "value" in C# 1.x; "yield" in C# 2.x; "select" in C# 3.x) requires such a "parser-level token recognition").

(b) In ANTLR 2, you can switch lexers along the way (I'd have to look upthe concrete way of doing this - but it's documented). This is a little bit tricky, because you must make sure that one lexer has not read too many characters from your input stream when switching to another one, but it makes things like "use a special lexer inside format strings; or inside regex strings" very easy. I have no idea how to do this in ANTLR 3, ath this moment (I'd have to look up one of our projects where I did jumping over 4 lexers in ANTLR 2; maybe a colleague of mine has converted this to ANTLR 3 - he might then know the trick ...).

Regards
Harald



-------- Original-Nachricht --------
> Datum: Sun, 25 Nov 2007 20:07:23 +1100
> Von: Basil Shkara <bshkara at gmail.com>
> An: antlr-interest at antlr.org
> Betreff: [antlr-interest] Match any unicode character

> Hi there,
> 
> I've been running into a dead-end for what seems like a simple problem  
> and hopefully someone out there has come across it in the past.
> 
> I have token definitions like so:
> WORD		:	~('\r' | '\n' | WS)+;
> WS				:   ' ' | '\t' | '\r' | '\n';
> 
> And I would like to be able to have a rule like this:
> matchthis:	'[' (WORD | WS)+;
> 
> Essentially, I would like to match a '[' followed by 1 or more unicode  
> characters as well as whitespace after it.
> 
> If I change the definition of WORD to be:
> WORD		:	~('\r' | '\n' | WS | '[')+;
> 
> Then my parser is able to match the rule above, however I would like  
> to be able to use this WORD token elsewhere in my parser grammar to  
> match other things like:
> 
> nowmatchthis:	'!' (WORD | WS)+;
> 
> This then entails creating another WORD rule excluding the '!'  
> literal.  However ANTLR doesn't like the existence of 2 of these token  
> definitions because it means that other tokens I have defined are  
> 'unreachable'.
> 
> So my question is how would I approach something like this?  I just  
> would like to match any unicode character after certain key characters.
> 
> Appreciate any help on the matter.
> 
> Thanks!

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer


More information about the antlr-interest mailing list