[antlr-interest] Tokenizing question

Sun Feb 10 14:33:06 PST 2008

Each word is separated with whitespace.  Again this is from a Verilog VCD
grammar that seems to have many ambiguities.  I rewrote it to make it simple
to explain.  Part of the original grammar looks like:

value_change_dump_definition
  : declaration_command* enddefinitions simulation_command*
  ;

declaration_command
  : <other_rules_here>
  | timescale
  ;

timescale
  : '$timescale' NUMBER time_unit '$end'

time_unit
  : 's'
  | 'ms'
  | 'us'
  | 'ns'
  | 'ps'
  | 'fs'
  ;

simulation_command
  : <other_rules_here>
  | value_change
  ;

value_change
  : scalar_value_change
  ;

scalar_value_change
  : VALUE IDENTIFIER
  ;

VALUE
  : ('0' | '1' | 'x' | 'X' | 'z' | 'Z')
  ;

IDENTIFIER
  : ('!'..'~')+
  ;

fragment
DIGIT
  : '0'..'9'
  ;

NUMBER
  : DIGIT+
  ;

The problem is the scalar_value_change rule.  VALUE and IDENTIFIER can be
connected together.

A sample scalar_value_change is:

1aae
0aae

There are many ambiguities in this grammar even at the lexer level that is
giving me a hard time.

-- Amal

On Feb 10, 2008 4:44 PM, Mark Volkmann <r.mark.volkmann at gmail.com> wrote:

> On Feb 10, 2008 9:17 AM, Amal Khailtash <akhailtash at gmail.com> wrote:
> > In a language that whitespace is ignored, how can one tokenize and parse
> > constructs like this:
> >
> >   word : number identifier ;
> >
> > where 'word' could look like:
> >
> >   10 abc  or  10abc
> >
> > In this case number and identifier could have no whitespace between them
> or
> > have some.
>
> How can you tell where one "word" ends and the next begins?
> Is each "word" on its own line?
>
> --
> R. Mark Volkmann
> Object Computing, Inc.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080210/d2c45662/attachment.html