[antlr-interest] Tokenizing question

Mark Volkmann r.mark.volkmann at gmail.com
Sun Feb 10 17:11:39 PST 2008


On Feb 10, 2008 4:33 PM, Amal Khailtash <akhailtash at gmail.com> wrote:
> Each word is separated with whitespace.

But the parts of each "word" are not. That seems to be the hard part!
For example, the input is "1aae", not "1 aae".
I've tried hard to figure this out and I'm coming up empty.
I hope someone else can offer a clue.

Is it true that the lexer uses the first lexer rule that matches when
multiple lexer rules match? That's what I thought, but now I'm not
sure.

> Again this is from a Verilog VCD
> grammar that seems to have many ambiguities.  I rewrote it to make it simple
> to explain.  Part of the original grammar looks like:
>
> value_change_dump_definition
>    : declaration_command* enddefinitions simulation_command*
>   ;
>
> declaration_command
>   : <other_rules_here>
>   | timescale
>   ;
>
> timescale
>   : '$timescale' NUMBER time_unit '$end'
>
> time_unit
>   : 's'
>   | 'ms'
>   | 'us'
>   | 'ns'
>   | 'ps'
>   | 'fs'
>   ;
>
> simulation_command
>   : <other_rules_here>
>   | value_change
>    ;
>
> value_change
>   : scalar_value_change
>   ;
>
> scalar_value_change
>   : VALUE IDENTIFIER
>   ;
>
> VALUE
>   : ('0' | '1' | 'x' | 'X' | 'z' | 'Z')
>    ;
>
> IDENTIFIER
>   : ('!'..'~')+
>   ;
>
> fragment
>  DIGIT
>    : '0'..'9'
>    ;
>
> NUMBER
>   : DIGIT+
>   ;
>
> The problem is the scalar_value_change rule.  VALUE and IDENTIFIER can be
> connected together.
>
> A sample scalar_value_change is:
>
> 1aae
> 0aae
>
> There are many ambiguities in this grammar even at the lexer level that is
> giving me a hard time.
>
> -- Amal
>
>
>
> On Feb 10, 2008 4:44 PM, Mark Volkmann <r.mark.volkmann at gmail.com> wrote:
> >
> >
> >
> > On Feb 10, 2008 9:17 AM, Amal Khailtash <akhailtash at gmail.com> wrote:
> > > In a language that whitespace is ignored, how can one tokenize and parse
> > > constructs like this:
> > >
> > >   word : number identifier ;
> > >
> > > where 'word' could look like:
> > >
> > >   10 abc  or  10abc
> > >
> > > In this case number and identifier could have no whitespace between them
> or
> > > have some.
> >
> > How can you tell where one "word" ends and the next begins?
> > Is each "word" on its own line?
> >
> > --
> > R. Mark Volkmann
> > Object Computing, Inc.
> >
>
>



-- 
R. Mark Volkmann
Object Computing, Inc.


More information about the antlr-interest mailing list