[antlr-interest] Tokenizing question

Mon Feb 11 13:44:51 PST 2008

Amal,

While antlr doesn't explicitly support lexer states, you can use the 
@lexer::members block to add data elements to your lexer, and of course 
you can put whatever code you want in place. I have used this with some 
success to "state-ify" parts of a lexer. You just have to code the state 
info by hand. :(

My approach dealt with using curly braces to delimit "opaque" versus 
"transparent" blocks. My grammar was such that every "opaque" block was 
preceded by one of a set of known tokens. Any time I saw such a token, I 
set a "opaque-block-coming" flag. The rest is pretty obvious.

=Austin

Amal Khailtash wrote:
> Yes, I think my biggest problem as you mentioned is the fact that
> VALUE, NUMBER and IDENTIFIER all overlap!  And yes, I get a NUMBER
> where I expect a VALUE, or an VALUE where I expect an IDENTIFIER, or
> so many other ways.
>
> I completely understand that LEXER is done at a different stage and
> that makes it difficult.  Tools like the old good lex have lexer
> states to do context sensitive lexing.  ANTLR does not have context
> sensitive lexing.
>
> So what is the recommended solution?  Should I merge all these rules
> into one?  Can I not use syntactic predicates in the lexer to resolve
> this?
>
> -- Amal
>
> On Feb 11, 2008 7:14 AM, Gavin Lambert <antlr at mirality.co.nz 
> <mailto:antlr at mirality.co.nz>> wrote:
>
>     At 11:33 11/02/2008, Amal Khailtash wrote:
>
>     >Each word is separated with whitespace.  Again this is from a
>     >Verilog VCD grammar that seems to have many ambiguities.  I
>     >rewrote it to make it simple to explain.  Part of the original
>     >grammar looks like:
>     [...]
>     >scalar_value_change
>     >   : VALUE IDENTIFIER
>     >   ;
>     >
>     >VALUE
>     >   : ('0' | '1' | 'x' | 'X' | 'z' | 'Z')
>     >   ;
>     >
>     >IDENTIFIER
>     >   : ('!'..'~')+
>     >   ;
>     >
>     >fragment
>     >DIGIT
>     >   : '0'..'9'
>     >   ;
>     >
>     >NUMBER
>     >   : DIGIT+
>     >   ;
>
>     You're going to have to be careful with that VALUE rule, since it
>     intersects with both IDENTIFIER and NUMBER.  (This isn't
>     necessarily an error, it just means you need to realise you might
>     end up with a VALUE token when you're expecting one of the
>     others.)
>
>     >The problem is the scalar_value_change rule.  VALUE and
>     >IDENTIFIER can be connected together.
>     >
>     >A sample scalar_value_change is:
>     >
>     >1aae
>     >0aae
>
>     I'm assuming there's also a WS rule with skip() or $channel =
>     HIDDEN that you didn't present above.
>
>     If both "1 aae" and "1aae" are valid constructs, then what you
>     already have should be fine.  Tokens are not required to be
>     separated by whitespace; whitespace (or any other skipped or
>     hidden token) merely act as a "break" between character sequences
>     that could otherwise have been merged into a single token.
>
>     In other words, "1 aae" should produce VALUE WS IDENTIFIER (with
>     the WS skipped or ignored), and "1aae" should produce VALUE
>     IDENTIFIER.  In both cases it matches the scalar_value_change
>     rule.
>
>     Now, "11aae" wouldn't -- that would be NUMBER IDENTIFIER.  But "1
>     1aae" would be VALUE WS VALUE IDENTIFIER, again with the WS
>     skipped or ignored.  So you can see the whitespace acting as a
>     token break here.
>
>
>     --
>     No virus found in this outgoing message.
>     Checked by AVG Free Edition.
>     Version: 7.5.516 / Virus Database: 269.20.2/1270 - Release Date:
>     10/02/2008 12:21
>
>
>
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG Free Edition. 
> Version: 7.5.516 / Virus Database: 269.20.2/1272 - Release Date: 2/11/2008 5:28 PM
>