[antlr-interest] [newbie] Lexer Confusion

Sat Jul 5 12:10:26 PDT 2008

Johannes Luber wrote:
> UW Student schrieb:
>  > Johannes Luber wrote:
>  >  > UW Student schrieb:
>  >  >>>> I would really prefer to have a single token.  Is it possible to
>  >  >>>> modify Johannes' version to handle that?
>  >  >>  >
>  >  >>> Try this:
>  >  >>>
>  >  >>> TERM1: '.' ( ('.')=> '.' {$type = TERM2;} )* ;
>  >  >>
>  >  >> Will that ensure that the number of DOTs consumed is even?  If I
>  >  >> understand correctly, it will simply catch any sequence of more than
>  >  >> one DOT.
>  >  >>
>  >  >> -Andrew
>  >  >>
>  >  >
>  >  > No, it won't. Try this:
>  >  >
>  >  > TERM1: '.' ( ('.')=> '.' {$type = TERM2;} '..'* ) ;
>  >  >
>  >  > But I wonder: Do you really need to create such a rule for a 
> particular
>  >  > language? Doing some regex should be faster there anywhere.
>  >  >
>  >  > Johannes
>  >  >
>  >
>  > Doesn't that have the original problem?  If there are three DOTs, then
>  > it will fail with a mismatched token exception, won't it?
>  >
>  > The '...'+ tokens are filler (like whitespace or comments) in the
>  > language I'm translating.  It would be much easier to look past them if
>  > they were lumped together.
>  >
>  > I agree that a regex would be a good solution for matching this token. I
>  > was hoping the Antlr lexer provided that kind of regex support.
> 
> If you want to treat '..' as filler, why don't you change the channel of 
> the TERM1 and TERM2 tokens? That way the number of tokens is irrelevant 
> (beyound some small increase of the memory footprint) and your grammar 
> can ignore those tokens at later stage.
>  >
>  > Thanks,
>  > Andrew
>  >
>  > p.s. Is this thread starting to clutter the mailing list?  At what point
>  > is it appropriate to take it offline?
>  >
> As long it is about ANTLR you can use the mailing list.
> 
> Johannes
> 
UW Student wrote:
 > Johannes Luber wrote:
 >> UW Student schrieb:
 >>  > Johannes Luber wrote:
 >>  >  > UW Student schrieb:
 >>  >  >>>> I would really prefer to have a single token.  Is it 
possible to
 >>  >  >>>> modify Johannes' version to handle that?
 >>  >  >>  >
 >>  >  >>> Try this:
 >>  >  >>>
 >>  >  >>> TERM1: '.' ( ('.')=> '.' {$type = TERM2;} )* ;
 >>  >  >>
 >>  >  >> Will that ensure that the number of DOTs consumed is even?  If I
 >>  >  >> understand correctly, it will simply catch any sequence of more
 >> than
 >>  >  >> one DOT.
 >>  >  >>
 >>  >  >> -Andrew
 >>  >  >>
 >>  >  >
 >>  >  > No, it won't. Try this:
 >>  >  >
 >>  >  > TERM1: '.' ( ('.')=> '.' {$type = TERM2;} '..'* ) ;
 >>  >  >
 >>  >  > But I wonder: Do you really need to create such a rule for a
 >> particular
 >>  >  > language? Doing some regex should be faster there anywhere.
 >>  >  >
 >>  >  > Johannes
 >>  >  >
 >>  >
 >>  > Doesn't that have the original problem?  If there are three DOTs, 
then
 >>  > it will fail with a mismatched token exception, won't it?
 >>  >
 >>  > The '...'+ tokens are filler (like whitespace or comments) in the
 >>  > language I'm translating.  It would be much easier to look past
 >> them if
 >>  > they were lumped together.
 >>  >
 >>  > I agree that a regex would be a good solution for matching this
 >> token. I
 >>  > was hoping the Antlr lexer provided that kind of regex support.
 >>
 >> If you want to treat '..' as filler, why don't you change the channel
 >> of the TERM1 and TERM2 tokens? That way the number of tokens is
 >> irrelevant (beyound some small increase of the memory footprint) and
 >> your grammar can ignore those tokens at later stage.
 >>  >
 >>  > Thanks,
 >>  > Andrew
 >>  >
 >>  > p.s. Is this thread starting to clutter the mailing list?  At what
 >> point
 >>  > is it appropriate to take it offline?
 >>  >
 >> As long it is about ANTLR you can use the mailing list.
 >>
 >> Johannes
 >>
 >
 > Alright, time to be more explicit about the real problem.
 >
 > '...' serves as the line continuation character.  Usually, that means
 > that it can be ignored as long as it also hides the following newline.
 > Unfortunately, in this language, it is sometimes significant.
 >
 > Ex 1
 >
 > x = 1 + ...
 > 2;
 >
 > In this case, it can be ignored.
 >
 > Ex 2
 >
 > [x ...
 > y];
 >
 > In this case, I want to turn it into a comma separating x and y.
 >
 > Ex 3
 >
 > [x ...
 > ...
 > y];
 >
 > In this case, I want a single comma, not two.
 >
 > As far as I can tell, to be able to handle this sort of thing I need to
 > keep '...' on the default channel (rather than the hidden channel). This
 > means that in some cases I will need to account for it in predicates
 > that use lookahead.  It would be much easier if contiguous filler text
 > was treated as a single token so that it would occupy only one lookahead
 > position.
 >
 > Secondarily, if they are lumped together, it is easier to avoid
 > inserting duplicate commas.
 >
 > I'm pretty sure I've solved the parser problem, so I may just switch to
 > a separate JFlex lexer.  However, I'd prefer to keep everything in Antlr
 > if I can.
 >
 > -Andrew
 >

Hi Johannes,

Using the syntactic predicate solved my problem.  I had tried that 
before, but I didn't realize that I had to apply the predicate to the 
subrule calling the lexer fragment rather than to the fragment itself 
(sometimes you have to look at the generated code ;)).

I'm still a little surprised that Antlr doesn't handle this sort of 
thing automatically, but at least there's a way to do it manually.

Thanks for your help!

-Andrew