[antlr-interest] Tokenizing question

Amal Khailtash akhailtash at gmail.com
Mon Feb 11 13:20:38 PST 2008


Yes, I think my biggest problem as you mentioned is the fact that
VALUE, NUMBER and IDENTIFIER all overlap!  And yes, I get a NUMBER
where I expect a VALUE, or an VALUE where I expect an IDENTIFIER, or
so many other ways.

I completely understand that LEXER is done at a different stage and
that makes it difficult.  Tools like the old good lex have lexer
states to do context sensitive lexing.  ANTLR does not have context
sensitive lexing.

So what is the recommended solution?  Should I merge all these rules
into one?  Can I not use syntactic predicates in the lexer to resolve
this?
-- Amal

On Feb 11, 2008 7:14 AM, Gavin Lambert <antlr at mirality.co.nz> wrote:

> At 11:33 11/02/2008, Amal Khailtash wrote:
>
> >Each word is separated with whitespace.  Again this is from a
> >Verilog VCD grammar that seems to have many ambiguities.  I
> >rewrote it to make it simple to explain.  Part of the original
> >grammar looks like:
> [...]
> >scalar_value_change
> >   : VALUE IDENTIFIER
> >   ;
> >
> >VALUE
> >   : ('0' | '1' | 'x' | 'X' | 'z' | 'Z')
> >   ;
> >
> >IDENTIFIER
> >   : ('!'..'~')+
> >   ;
> >
> >fragment
> >DIGIT
> >   : '0'..'9'
> >   ;
> >
> >NUMBER
> >   : DIGIT+
> >   ;
>
> You're going to have to be careful with that VALUE rule, since it
> intersects with both IDENTIFIER and NUMBER.  (This isn't
> necessarily an error, it just means you need to realise you might
> end up with a VALUE token when you're expecting one of the
> others.)
>
> >The problem is the scalar_value_change rule.  VALUE and
> >IDENTIFIER can be connected together.
> >
> >A sample scalar_value_change is:
> >
> >1aae
> >0aae
>
> I'm assuming there's also a WS rule with skip() or $channel =
> HIDDEN that you didn't present above.
>
> If both "1 aae" and "1aae" are valid constructs, then what you
> already have should be fine.  Tokens are not required to be
> separated by whitespace; whitespace (or any other skipped or
> hidden token) merely act as a "break" between character sequences
> that could otherwise have been merged into a single token.
>
> In other words, "1 aae" should produce VALUE WS IDENTIFIER (with
> the WS skipped or ignored), and "1aae" should produce VALUE
> IDENTIFIER.  In both cases it matches the scalar_value_change
> rule.
>
> Now, "11aae" wouldn't -- that would be NUMBER IDENTIFIER.  But "1
> 1aae" would be VALUE WS VALUE IDENTIFIER, again with the WS
> skipped or ignored.  So you can see the whitespace acting as a
> token break here.
>
>
> --
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.5.516 / Virus Database: 269.20.2/1270 - Release Date:
> 10/02/2008 12:21
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080211/b7a78113/attachment.html 


More information about the antlr-interest mailing list