[antlr-interest] Tokens that span across char streams

Wed Aug 26 14:34:32 PDT 2009

At 07:57 27/08/2009, Stanislav Sokorac wrote:
>I have a language that allows macros to be used just about 
>anywhere, which makes things a bit difficult. For example, a 
>macro could define half a string, and something like this is 
>legal:
>
>#define FOO "start of a string
>String a = FOO end of a string";
>
>If I do on-the-fly substitution of macros by switching char 
>streams (using the include file technique from the FAQ), lexer 
>cannot recognize the string in the second line: it parses the 
>macro text, encounters EOF of that stream, throws an exception 
>("couldn't match anything"), and then start over at the second 
>half of the string, again not matching anything.
>
>What's a good way to "smooth over" the EOF bump, and merge the 
>streams into one from lexer's point of view? Do I need to 
>implement a custom CharStream to do something like this?

That wouldn't really help.  Consider the case where there is 
another unrelated line (perhaps another #define) between those two 
above.  Unless you were switching streams when you encounter *use* 
of the #defined tokens (and not just on include) you still 
wouldn't be able to parse a complete string token.

You *could* define a preprocessing CharStream, that recognises the 
use of #define and simply files the characters until EOL away 
without actually passing them on, then later recognises the name 
of the defined symbol and passes on the value rather than the 
name.  That's probably about the best you can do, if you don't 
want to do a full preprocessing pass.  I think it'll still have 
the same line numbering issues though.

>Of course, I could have a pre-process run that replaces all the 
>macros, and then run through the resulting code, but I'd like to 
>avoid that because (1) it's slow to go through the file twice, 
>and (2) the character/line numbers in tokens will be messed up in 
>the second run and it'll take a bit of work to bring them back to 
>the original locations.

I really think that the preprocess run is the simplest option for 
that sort of thing.  Either that or using only single characters 
as tokens; given your example you definitely can't match an entire 
quoted string as a single token without grief.

Is this your own language?  If it is, you should consider changing 
that rule and adopt something like C's string constant folding 
instead; for example the C equivalent to your above example (and 
which uses complete, lexable tokens) is this:

#define FOO "start of a string"
String a = FOO " end of a string";

(In C, two sequential string constants are treated as if they were 
a single constant with the text appended together.)