[antlr-interest] Tokenizing question
Gavin Lambert
antlr at mirality.co.nz
Mon Feb 11 04:14:51 PST 2008
At 11:33 11/02/2008, Amal Khailtash wrote:
>Each word is separated with whitespace. Again this is from a
>Verilog VCD grammar that seems to have many ambiguities. I
>rewrote it to make it simple to explain. Part of the original
>grammar looks like:
[...]
>scalar_value_change
> : VALUE IDENTIFIER
> ;
>
>VALUE
> : ('0' | '1' | 'x' | 'X' | 'z' | 'Z')
> ;
>
>IDENTIFIER
> : ('!'..'~')+
> ;
>
>fragment
>DIGIT
> : '0'..'9'
> ;
>
>NUMBER
> : DIGIT+
> ;
You're going to have to be careful with that VALUE rule, since it
intersects with both IDENTIFIER and NUMBER. (This isn't
necessarily an error, it just means you need to realise you might
end up with a VALUE token when you're expecting one of the
others.)
>The problem is the scalar_value_change rule. VALUE and
>IDENTIFIER can be connected together.
>
>A sample scalar_value_change is:
>
>1aae
>0aae
I'm assuming there's also a WS rule with skip() or $channel =
HIDDEN that you didn't present above.
If both "1 aae" and "1aae" are valid constructs, then what you
already have should be fine. Tokens are not required to be
separated by whitespace; whitespace (or any other skipped or
hidden token) merely act as a "break" between character sequences
that could otherwise have been merged into a single token.
In other words, "1 aae" should produce VALUE WS IDENTIFIER (with
the WS skipped or ignored), and "1aae" should produce VALUE
IDENTIFIER. In both cases it matches the scalar_value_change
rule.
Now, "11aae" wouldn't -- that would be NUMBER IDENTIFIER. But "1
1aae" would be VALUE WS VALUE IDENTIFIER, again with the WS
skipped or ignored. So you can see the whitespace acting as a
token break here.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.516 / Virus Database: 269.20.2/1270 - Release Date: 10/02/2008 12:21
More information about the antlr-interest
mailing list