[antlr-interest] Tokens that span across char streams

Wed Aug 26 15:05:20 PDT 2009

I failed to mention that the language requires a special character to be
used before a macro *use*, which makes it easy to detect macros being used
(and allows the flexibility of having it used just about anywhere)... the
line below should read

String a = #FOO end of a string";

Unfortunately, it's not my language, so I'm stuck with the way it is.

With above in mind, it's easy for the lexer to detect both macro definitions
and macro uses. So, what I'm doing is having lexer record macro definitions,
and then have it perform stream switches when a macro use is detected. But,
since lexer can't look past the EOF of one stream into the next one, that
doesn't really help my case below. On the other hand, having the char stream
itself handle the insertion of the macro value will mess up the
line/character numbering, like you said.

Who keeps track of line numbers and character positions? The char stream, or
the lexer? If it's the stream, then I could "fool" the lexer into recording
the original numbers into tokens... If it's the lexer, then I'd have to do
post-processing on tokens to fix them, I guess.

Stan

On Wed, Aug 26, 2009 at 5:34 PM, Gavin Lambert <antlr at mirality.co.nz> wrote:

> At 07:57 27/08/2009, Stanislav Sokorac wrote:
>
>> I have a language that allows macros to be used just about anywhere, which
>> makes things a bit difficult. For example, a macro could define half a
>> string, and something like this is legal:
>>
>> #define FOO "start of a string
>> String a = FOO end of a string";
>>
>> If I do on-the-fly substitution of macros by switching char streams (using
>> the include file technique from the FAQ), lexer cannot recognize the string
>> in the second line: it parses the macro text, encounters EOF of that stream,
>> throws an exception ("couldn't match anything"), and then start over at the
>> second half of the string, again not matching anything.
>>
>> What's a good way to "smooth over" the EOF bump, and merge the streams
>> into one from lexer's point of view? Do I need to implement a custom
>> CharStream to do something like this?
>>
>
> That wouldn't really help.  Consider the case where there is another
> unrelated line (perhaps another #define) between those two above.  Unless
> you were switching streams when you encounter *use* of the #defined tokens
> (and not just on include) you still wouldn't be able to parse a complete
> string token.
>
> You *could* define a preprocessing CharStream, that recognises the use of
> #define and simply files the characters until EOL away without actually
> passing them on, then later recognises the name of the defined symbol and
> passes on the value rather than the name.  That's probably about the best
> you can do, if you don't want to do a full preprocessing pass.  I think
> it'll still have the same line numbering issues though.
>
>  Of course, I could have a pre-process run that replaces all the macros,
>> and then run through the resulting code, but I'd like to avoid that because
>> (1) it's slow to go through the file twice, and (2) the character/line
>> numbers in tokens will be messed up in the second run and it'll take a bit
>> of work to bring them back to the original locations.
>>
>
> I really think that the preprocess run is the simplest option for that sort
> of thing.  Either that or using only single characters as tokens; given your
> example you definitely can't match an entire quoted string as a single token
> without grief.
>
> Is this your own language?  If it is, you should consider changing that
> rule and adopt something like C's string constant folding instead; for
> example the C equivalent to your above example (and which uses complete,
> lexable tokens) is this:
>
> #define FOO "start of a string"
> String a = FOO " end of a string";
>
> (In C, two sequential string constants are treated as if they were a single
> constant with the text appended together.)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090826/e3acd8e9/attachment.html