[antlr-interest] Comment rule matches links

Thomas Brandon tbrandonau at gmail.com
Tue Aug 26 04:23:40 PDT 2008


On Tue, Aug 26, 2008 at 7:22 PM, Jenny Balfer
<ai06087 at lehre.ba-stuttgart.de> wrote:
>> > If you don't need the regular expression structure you can just match
>> > them as a single token. If you do need structure you could either just
>> > use a seperate grammar in a later phase to process the single tokens
>> > in your main token stream or use the technique from the island-grammar
>> > example, which shows island grammars under lexer control,
>>
>> Island grammars under lexer control will probably not cut it, as the
>> '/' token is ambiguous in many languages, e.g.
>> "int x = 5 / 3;" vs "match(/3/, ...);". The grammar/lexer switching
>> has to be done in the parser.
>>
>> Regards,
>> Martin
>
> That's right, but how can I implement an island grammar under parser
> control if the string matching already was done in the lexer?
>
> Due to the fact comments are not part of the program statements, they have
> to be skpped in the lexer, and to avoid strings containing //s to be
> skipped, I implemented the string token rule also in the lexer. So I really
> need a way to handle my regexp problem in the lexer, too - or is there
> another way?
>
Sorry, I misread the OP's post as saying that it was not lexically
ambiguous in the language at issue. If it is lexically ambiguous and
assuming you can't alter the language or establish conventions then
curse the language designers as you push on with a nasty solution.
I think you are misunderstanding the parser controlled multi-lexer
approach. You switch to a new lexer, so it doesn't matter what the
main lexer would do. However I would advise against going this route
unless absolutely necessary. This approach does not (generally) work
in the presence of parser lookahead past the lexer switch. Thus
seemingly unrelated changes can cause problems which are hard to
detect, diagnose and solve.
An alternate solution is to analyse the context of the / in the lexer
to determine if it is a regular expression or division (e.g. a
division cannot directly follow an equals, so if the last non-hidden
token was an equals it must be a regular expression). Depending on
your language it may be hard to formulate and implement an adequate
set of rules at the lexer level.
Alternately leave the work for a parser and have your lexer only
recognise very basic structures (e.g. no comments or strings).
Performance will likely suffer as will grammar readability (with
comments needed everywhere).
Or perhaps you could use two parsers, the first doing token stream
rewriting. Again have a basic lexer, then have a parser that goes
through and pulls out just enough context to figure the ambiguous bits
out and rewrite the token stream for a subsequent parser to handle.
Depending on your language and how little you can ge away with
handling this may or may not help.
The final option is to investigate other parser tools which don't
seperate parser and lexer so much, e.g. Rats. You could still use
ANTLR tree parsers through a tree adaptor to have their power or
integrate with other ANTLR parsers\lexers.

Tom.


More information about the antlr-interest mailing list