[antlr-interest] Tokens that span across char streams
Gavin Lambert
antlr at mirality.co.nz
Wed Aug 26 14:34:32 PDT 2009
At 07:57 27/08/2009, Stanislav Sokorac wrote:
>I have a language that allows macros to be used just about
>anywhere, which makes things a bit difficult. For example, a
>macro could define half a string, and something like this is
>legal:
>
>#define FOO "start of a string
>String a = FOO end of a string";
>
>If I do on-the-fly substitution of macros by switching char
>streams (using the include file technique from the FAQ), lexer
>cannot recognize the string in the second line: it parses the
>macro text, encounters EOF of that stream, throws an exception
>("couldn't match anything"), and then start over at the second
>half of the string, again not matching anything.
>
>What's a good way to "smooth over" the EOF bump, and merge the
>streams into one from lexer's point of view? Do I need to
>implement a custom CharStream to do something like this?
That wouldn't really help. Consider the case where there is
another unrelated line (perhaps another #define) between those two
above. Unless you were switching streams when you encounter *use*
of the #defined tokens (and not just on include) you still
wouldn't be able to parse a complete string token.
You *could* define a preprocessing CharStream, that recognises the
use of #define and simply files the characters until EOL away
without actually passing them on, then later recognises the name
of the defined symbol and passes on the value rather than the
name. That's probably about the best you can do, if you don't
want to do a full preprocessing pass. I think it'll still have
the same line numbering issues though.
>Of course, I could have a pre-process run that replaces all the
>macros, and then run through the resulting code, but I'd like to
>avoid that because (1) it's slow to go through the file twice,
>and (2) the character/line numbers in tokens will be messed up in
>the second run and it'll take a bit of work to bring them back to
>the original locations.
I really think that the preprocess run is the simplest option for
that sort of thing. Either that or using only single characters
as tokens; given your example you definitely can't match an entire
quoted string as a single token without grief.
Is this your own language? If it is, you should consider changing
that rule and adopt something like C's string constant folding
instead; for example the C equivalent to your above example (and
which uses complete, lexable tokens) is this:
#define FOO "start of a string"
String a = FOO " end of a string";
(In C, two sequential string constants are treated as if they were
a single constant with the text appended together.)
More information about the antlr-interest
mailing list