[antlr-interest] [newbie] Lexer Confusion
UW Student
uw.anon at gmail.com
Sat Jul 5 12:10:26 PDT 2008
Johannes Luber wrote:
> UW Student schrieb:
> > Johannes Luber wrote:
> > > UW Student schrieb:
> > >>>> I would really prefer to have a single token. Is it possible to
> > >>>> modify Johannes' version to handle that?
> > >> >
> > >>> Try this:
> > >>>
> > >>> TERM1: '.' ( ('.')=> '.' {$type = TERM2;} )* ;
> > >>
> > >> Will that ensure that the number of DOTs consumed is even? If I
> > >> understand correctly, it will simply catch any sequence of more than
> > >> one DOT.
> > >>
> > >> -Andrew
> > >>
> > >
> > > No, it won't. Try this:
> > >
> > > TERM1: '.' ( ('.')=> '.' {$type = TERM2;} '..'* ) ;
> > >
> > > But I wonder: Do you really need to create such a rule for a
> particular
> > > language? Doing some regex should be faster there anywhere.
> > >
> > > Johannes
> > >
> >
> > Doesn't that have the original problem? If there are three DOTs, then
> > it will fail with a mismatched token exception, won't it?
> >
> > The '...'+ tokens are filler (like whitespace or comments) in the
> > language I'm translating. It would be much easier to look past them if
> > they were lumped together.
> >
> > I agree that a regex would be a good solution for matching this token. I
> > was hoping the Antlr lexer provided that kind of regex support.
>
> If you want to treat '..' as filler, why don't you change the channel of
> the TERM1 and TERM2 tokens? That way the number of tokens is irrelevant
> (beyound some small increase of the memory footprint) and your grammar
> can ignore those tokens at later stage.
> >
> > Thanks,
> > Andrew
> >
> > p.s. Is this thread starting to clutter the mailing list? At what point
> > is it appropriate to take it offline?
> >
> As long it is about ANTLR you can use the mailing list.
>
> Johannes
>
UW Student wrote:
> Johannes Luber wrote:
>> UW Student schrieb:
>> > Johannes Luber wrote:
>> > > UW Student schrieb:
>> > >>>> I would really prefer to have a single token. Is it
possible to
>> > >>>> modify Johannes' version to handle that?
>> > >> >
>> > >>> Try this:
>> > >>>
>> > >>> TERM1: '.' ( ('.')=> '.' {$type = TERM2;} )* ;
>> > >>
>> > >> Will that ensure that the number of DOTs consumed is even? If I
>> > >> understand correctly, it will simply catch any sequence of more
>> than
>> > >> one DOT.
>> > >>
>> > >> -Andrew
>> > >>
>> > >
>> > > No, it won't. Try this:
>> > >
>> > > TERM1: '.' ( ('.')=> '.' {$type = TERM2;} '..'* ) ;
>> > >
>> > > But I wonder: Do you really need to create such a rule for a
>> particular
>> > > language? Doing some regex should be faster there anywhere.
>> > >
>> > > Johannes
>> > >
>> >
>> > Doesn't that have the original problem? If there are three DOTs,
then
>> > it will fail with a mismatched token exception, won't it?
>> >
>> > The '...'+ tokens are filler (like whitespace or comments) in the
>> > language I'm translating. It would be much easier to look past
>> them if
>> > they were lumped together.
>> >
>> > I agree that a regex would be a good solution for matching this
>> token. I
>> > was hoping the Antlr lexer provided that kind of regex support.
>>
>> If you want to treat '..' as filler, why don't you change the channel
>> of the TERM1 and TERM2 tokens? That way the number of tokens is
>> irrelevant (beyound some small increase of the memory footprint) and
>> your grammar can ignore those tokens at later stage.
>> >
>> > Thanks,
>> > Andrew
>> >
>> > p.s. Is this thread starting to clutter the mailing list? At what
>> point
>> > is it appropriate to take it offline?
>> >
>> As long it is about ANTLR you can use the mailing list.
>>
>> Johannes
>>
>
> Alright, time to be more explicit about the real problem.
>
> '...' serves as the line continuation character. Usually, that means
> that it can be ignored as long as it also hides the following newline.
> Unfortunately, in this language, it is sometimes significant.
>
> Ex 1
>
> x = 1 + ...
> 2;
>
> In this case, it can be ignored.
>
> Ex 2
>
> [x ...
> y];
>
> In this case, I want to turn it into a comma separating x and y.
>
> Ex 3
>
> [x ...
> ...
> y];
>
> In this case, I want a single comma, not two.
>
> As far as I can tell, to be able to handle this sort of thing I need to
> keep '...' on the default channel (rather than the hidden channel). This
> means that in some cases I will need to account for it in predicates
> that use lookahead. It would be much easier if contiguous filler text
> was treated as a single token so that it would occupy only one lookahead
> position.
>
> Secondarily, if they are lumped together, it is easier to avoid
> inserting duplicate commas.
>
> I'm pretty sure I've solved the parser problem, so I may just switch to
> a separate JFlex lexer. However, I'd prefer to keep everything in Antlr
> if I can.
>
> -Andrew
>
Hi Johannes,
Using the syntactic predicate solved my problem. I had tried that
before, but I didn't realize that I had to apply the predicate to the
subrule calling the lexer fragment rather than to the fragment itself
(sometimes you have to look at the generated code ;)).
I'm still a little surprised that Antlr doesn't handle this sort of
thing automatically, but at least there's a way to do it manually.
Thanks for your help!
-Andrew
More information about the antlr-interest
mailing list