[antlr-interest] String lexing and partial tokens

Mon Nov 27 13:14:04 PST 2006

-----Original Message-----
From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-

> Looking in the archives seems to indicate that ! is no longer  
> supported, which is a pain in the butt.  It was a nice simple  
> syntax, and the alternatives all seem a lot more complicated.   
> Incidentally, what *is* the recommended alternative?  Further posts  
> seemed to suggest that calling $setText or setText would do the  
> trick, but those functions don't seem to exist in the C runtime  
> (which is what I'm trying to use); or at least I can't find them.

You can  ask Jim Idle about that, but we decided to use methods for  
setting the text rather than implementing ! which makes everything  
inefficient. I could swear there was something in the documentation.

I failed to get the C codegen templates up to date with the new $lexerelement codegen templates, one of these additions that I have not got to yet is $setText. 

however, in this case all you want to do is create the token such that the start and end of it exclude the delimiters, so you don’t need to do a $setText.

The lexer emits a token automatically if you have not emitted one, but if you use (C output) emitNew() in an action then it will use this as the token. So, to exclude the start and end character:

STRING: '"' (~'"')* '"'
	{
		emitNew(type,line,charPosition,channel,start,getCharIndex()-1);
	}

Notes: 
  That might actually need to be getCharIndex()-2;
  I will add the lexer $lexerelement constructus shortly, at which point you will need to use:
	emitNew($type,$line,$charPosition,$channel,$start,getCharIndex()-1);

Also, note the string is not actualized (does not create an ANTLR3_STRING C structure) unless you reference the .text in the parser. Whereas a $setTExt will create an ANTLR3_STRING even if you don’t end up needing it. This is the lexer does not create any strings that are not needed, with the associated malloc() and free() etc. This does mean though that you will get a new copy of the token text everytime you reference .text, so if you want to reference the text multiple times, create a local pointer and reuse it:

summat: s=STRING x y z
	{
		ANTLR3_STRING theString;

		theString = $s.text;

		printf("Char 3 is '%c', char 5 is '%c'\n", theString->charAt(theString, 3), theString->charAt(theString, 5)
	;

The text of an ANTLR3_STRING is available at theString->chars, but if this is a UTF16 or other non-ascii input, then you would have to code lots of string things yourself. The ANTLR3_STRING comes with a number of helper methods that are encoding independent such as subString(), charAt(), append(), addc() and so on, so it is generally better to work with these.

As people are starting to try the C output, it is probably time for me to create the C equivalents of the Java examples and provide the doxygen runtime docs etc. I will shortly ask Ter to put the C runtime distribution as a download link on the ANTLR3 page (VS2005 .sln  and ./configure).

Jim

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.430 / Virus Database: 268.14.17/553 - Release Date: 11/27/2006 4:00 AM