[antlr-interest] Tokenizing question

Mon Feb 11 04:14:51 PST 2008

At 11:33 11/02/2008, Amal Khailtash wrote:

>Each word is separated with whitespace.  Again this is from a 
>Verilog VCD grammar that seems to have many ambiguities.  I 
>rewrote it to make it simple to explain.  Part of the original 
>grammar looks like:
[...]
>scalar_value_change
>   : VALUE IDENTIFIER
>   ;
>
>VALUE
>   : ('0' | '1' | 'x' | 'X' | 'z' | 'Z')
>   ;
>
>IDENTIFIER
>   : ('!'..'~')+
>   ;
>
>fragment
>DIGIT
>   : '0'..'9'
>   ;
>
>NUMBER
>   : DIGIT+
>   ;

You're going to have to be careful with that VALUE rule, since it 
intersects with both IDENTIFIER and NUMBER.  (This isn't 
necessarily an error, it just means you need to realise you might 
end up with a VALUE token when you're expecting one of the 
others.)

>The problem is the scalar_value_change rule.  VALUE and 
>IDENTIFIER can be connected together.
>
>A sample scalar_value_change is:
>
>1aae
>0aae

I'm assuming there's also a WS rule with skip() or $channel = 
HIDDEN that you didn't present above.

If both "1 aae" and "1aae" are valid constructs, then what you 
already have should be fine.  Tokens are not required to be 
separated by whitespace; whitespace (or any other skipped or 
hidden token) merely act as a "break" between character sequences 
that could otherwise have been merged into a single token.

In other words, "1 aae" should produce VALUE WS IDENTIFIER (with 
the WS skipped or ignored), and "1aae" should produce VALUE 
IDENTIFIER.  In both cases it matches the scalar_value_change 
rule.

Now, "11aae" wouldn't -- that would be NUMBER IDENTIFIER.  But "1 
1aae" would be VALUE WS VALUE IDENTIFIER, again with the WS 
skipped or ignored.  So you can see the whitespace acting as a 
token break here.

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.20.2/1270 - Release Date: 10/02/2008 12:21