[antlr-interest] Tokenizing question

Mon Feb 11 13:39:34 PST 2008

On Feb 11, 2008 3:20 PM, Amal Khailtash <akhailtash at gmail.com> wrote:
> Yes, I think my biggest problem as you mentioned is the fact that
>  VALUE, NUMBER and IDENTIFIER all overlap!  And yes, I get a NUMBER
>  where I expect a VALUE, or an VALUE where I expect an IDENTIFIER, or
>  so many other ways.
>
>
> I completely understand that LEXER is done at a different stage and
>  that makes it difficult.  Tools like the old good lex have lexer
>  states to do context sensitive lexing.  ANTLR does not have context
>  sensitive lexing.
>
>
> So what is the recommended solution?  Should I merge all these rules
>  into one?  Can I not use syntactic predicates in the lexer to resolve
>  this?

I think Shmuel Siegel provided a solution in the thread on "Lexer
ambiguities". The trick is to make the most general of your
conflicting rules be a lexer rule and make the other, more specific
rules be parser rules.

Here's some sample input.

$timescale
  19ms
$end
1Amal

Here's a grammar that parses it using the trick from Shmuel.

grammar Verilog2;

value_change_dump_definition
  : declaration_command* simulation_command* EOF
  ; // omitted enddefinitions from middle of sequence

declaration_command: timescale; // omitted other alternatives
timescale: '$timescale' NUMBER time_unit '$end';
time_unit: 's' | 'ms' | 'us' | 'ns' | 'ps' | 'fs';
simulation_command: value_change; // omitted other alternatives
value_change: scalar_value_change;
scalar_value_change: value IDENTIFIER;

value: '0' | '1' | 'x' | 'X' | 'z' | 'Z';
NUMBER: DIGIT+;
fragment DIGIT: '0'..'9';

// An IDENTIFIER cannot begin with a digit.
IDENTIFIER: ('!'..'/' | ':'..'~') ('!'..'~')*;

WHITESPACE: (' ' | '\t')+ { $channel = HIDDEN; };
NEWLINE: ('\r'? '\n')+ { $channel = HIDDEN; };

> On Feb 11, 2008 7:14 AM, Gavin Lambert <antlr at mirality.co.nz> wrote:
> >
> > At 11:33 11/02/2008, Amal Khailtash wrote:
> >
> > >Each word is separated with whitespace.  Again this is from a
> > >Verilog VCD grammar that seems to have many ambiguities.  I
> > >rewrote it to make it simple to explain.  Part of the original
> > >grammar looks like:
> > [...]
> >
> > >scalar_value_change
> > >   : VALUE IDENTIFIER
> > >   ;
> > >
> > >VALUE
> > >   : ('0' | '1' | 'x' | 'X' | 'z' | 'Z')
> > >   ;
> > >
> > >IDENTIFIER
> > >   : ('!'..'~')+
> > >   ;
> > >
> > >fragment
> > >DIGIT
> > >   : '0'..'9'
> > >   ;
> > >
> > >NUMBER
> > >   : DIGIT+
> > >   ;
> >
> > You're going to have to be careful with that VALUE rule, since it
> > intersects with both IDENTIFIER and NUMBER.  (This isn't
> > necessarily an error, it just means you need to realise you might
> > end up with a VALUE token when you're expecting one of the
> > others.)
> >
> >
> > >The problem is the scalar_value_change rule.  VALUE and
> > >IDENTIFIER can be connected together.
> > >
> > >A sample scalar_value_change is:
> > >
> > >1aae
> > >0aae
> >
> > I'm assuming there's also a WS rule with skip() or $channel =
> > HIDDEN that you didn't present above.
> >
> > If both "1 aae" and "1aae" are valid constructs, then what you
> > already have should be fine.  Tokens are not required to be
> > separated by whitespace; whitespace (or any other skipped or
> > hidden token) merely act as a "break" between character sequences
> > that could otherwise have been merged into a single token.
> >
> > In other words, "1 aae" should produce VALUE WS IDENTIFIER (with
> > the WS skipped or ignored), and "1aae" should produce VALUE
> > IDENTIFIER.  In both cases it matches the scalar_value_change
> > rule.
> >
> > Now, "11aae" wouldn't -- that would be NUMBER IDENTIFIER.  But "1
> > 1aae" would be VALUE WS VALUE IDENTIFIER, again with the WS
> > skipped or ignored.  So you can see the whitespace acting as a
> > token break here.

-- 
R. Mark Volkmann
Object Computing, Inc.