[antlr-interest] Writing (for now) just a lexer

Wed Feb 11 17:50:53 PST 2009

On Thu, Feb 12, 2009 at 10:42 AM, Evan Driscoll <driscoll at cs.wisc.edu>wrote:

> Thanks for the responses before. I mention them below. I have a new
> question though, which is that it doesn't seem to ignore
> hidden/whitespace tokens.
>
> I have this definition:
>    NEWLINE : '\n'   { $channel=HIDDEN; }
>            | '\r\n' { $channel=HIDDEN; }
>            | '\r'   { $channel=HIDDEN; }
>            ;
> but it still gets returned by the nextToken() function.
>
> Does the $channel=HIDDEN only work if you start using it in the context
> of a lexer or something?

Ignoring of off-channel tokens (like those on the HIDDEN channel) is handled
in the TokenStream. Either attach a TokenStream and access tokens through
that or implement your own handling of them. Or as you found use skip().

>
>
> If you want to see code,
>   Grammar: www.cs.wisc.edu/~cs536-1/projects/pr2/html/Tea.g.html<http://www.cs.wisc.edu/%7Ecs536-1/projects/pr2/html/Tea.g.html>
>   Main fn: www.cs.wisc.edu/~cs536-1/projects/pr2/html/Test.java.html<http://www.cs.wisc.edu/%7Ecs536-1/projects/pr2/html/Test.java.html>
>
> (While I'm throwing out questions, is there a better way to do the
> tokenNames() function in the main class? There's a getTokenNames()
> method in the generated class, but it doesn't seem to work; my
> impression is that it's generated for /parsers/ to contain the tokens
> they refer to.)
>
>
> Johannes Luber wrote:
> > Furthermore, tokens with similar starting sequences need to be treated
> specially, as shown on <
> http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs
> >.
>
> This is something I don't quite understand, because I can do this:
>    INT_LITERAL
>        : ('0'..'9')+
>        ;
>
>   FLOAT_LITERAL
>        : ('0'..'9')+ '.' ('0'..'9')+
>        ;
> and it seems to just work. I can even use "INT_LITERAL '.' INT_LITERAL"
> as the pattern for the float.
>
> (I'm ignoring exponential numbers.)
>
The problem is not tokens with similar starting sequences per se but an
issue of matching a single token versus a sequence of tokens. In generating
lookaheads ANTLR only considers a single token, so in your grammar ANTLR
will have no problem as both are single tokens and so are fully considered.
If you add the following rules:
DOT: '.';
ID: ('a'..'z')+;
And give it as input something like "10.abc" then I think you should see the
problem. Here at the start of the input ANTLR will lookahead over the
initial numbers and then as soon as it sees the '.' will decide that it must
be a float, then when it encounters a letter rather than a number it will
generate an error as it can't match the second ('0'..'9')+ loop of the float
rule.
As ANTLR only considers single tokens it is essentially matching the input
against an implicit rule:
MTOKENS: (INT_LITERAL|FLOAT_LITERAL|DOT|ID);
You can see this rule in the generated mTokens method. Given this choice
seeing the dot is enough to predict that it is a float literal (given that
the longest match wins) so ANTLR will make this decision.

Tom.

>
>
> Thomas Brandon wrote:
> > You could emit your own token in a lexer rule in ANTLR which will
> > prevent ANTLR generating it's own token. Something like:
> > INT:
> >     '0'..'9'+
> >     {
> >          Token t = new IntLiteralToken(input, state.type,
> >                      state.channel, state.tokenStartCharIndex,
> >                      getCharIndex()-1);
> >          t.setLine(state.tokenStartLine);
> >          t.setText(state.text);
> >          t.setCharPositionInLine(state.tokenStartCharPositionInLine);
> >          emit(t);
> >     }
> >     ;
>
> This makes sense; thanks.
>
>
> > You can easily isolate it in the parser through a parser rule that
> > takes care of it like (check syntax):
> > int: INT -> INT<IntLiteralNode>[$INT];
> > Where the IntLiteralNode constructor takes care of parsing the text to
> > get an int value. As you can see custom nodes are rather simpler to do
> > than custom tokens. And I think having custom nodes rather than tokens
> > would be the more standard practice as you would more often have most
> > of your functionality in the AST rather than the tokens.
>
> I'll give this some thought too. (Maybe my conceptual model is different
> for not having worked with a combined lexer/parser or something like
> that, or maybe for having done a lot of processing of ASTs that are
> unrelated to actual parsing, but it seems more natural to me to have the
> text->int conversion done outside of the AST constructor.)
>
>
> > Ths issue of whether int range checking should be in the lexer or
> > parser seems neither here nor there. It doesn't seem like something
> > that should halt further processing or that will introduce any
> > syntactic ambiguity so it doesn't seem like it *needs* to be in the
> > lexer.
>
> Well, it doesn't need to be in the lexer. From a compiler construction
> standpoint, it seems like a toss-up to me. I'm more interested from a
> pedagogical standpoint, and it's only a slight preference even then.
>
> (You could also split up range detection so it's separate from actually
> converting the number like the sample in the example Johannes Luber
> linked to, and check the range in the lexer but actually convert it
> later. But seems to me the best way to check the range is to try to
> convert it anyway, so this seems the least attractive.)
>
>
> Evan Driscoll
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090212/83e874ea/attachment.html