[antlr-interest] Help with pesky Lexer determinism

Nigel Sheridan-Smith nbsherid at secsme.org.au
Tue Jun 7 03:52:55 PDT 2005


Hey Mark,

Something to be aware of: IPv6 addressing rules are very flexible... I came
across this a few months ago...

>From RFC 2373 (http://www.ietf.org/rfc/rfc2373.txt), the following are all
valid addresses:

         1080:0:0:0:8:800:200C:417A  a unicast address
         FF01:0:0:0:0:0:0:101        a multicast address
         0:0:0:0:0:0:0:1             the loopback address
         0:0:0:0:0:0:0:0             the unspecified addresses

shortened versions:
        1080::8:800:200C:417A       a unicast address
         FF01::101                   a multicast address
         ::1                         the loopback address
         ::                          the unspecified addresses

And IPv4 equivalence:

	   0:0:0:0:0:0:13.1.68.3

         0:0:0:0:0:FFFF:129.144.52.38

      or in compressed form:

         ::13.1.68.3

         ::FFFF:129.144.52.38


and then there are prefixes and things to contend with as well, depending on
your language.

Currently, my language supports IPv4/v6 addressing to some degree, but its
not finished and I've chosen to use '#' delimiters to avoid conflicts with
numerical types. To deal with this issue, I just used a generic token
matcher (that disambiguates IPv4, IPv6, dates and 'hashed' or binary data,
which all use the '#' delimiters), and then I'll add some more semantic
checks further down the chain. However, you may not be at the same liberty
in having such delimiters (depending on your requirements handed to you),
and so you will hit non-determinisms pretty quickly!

The best way to deal with this sort of thing is start with a protected lexer
definitions, and combine them into one rule:

   IPADDRHASHDATE: (IPV4) => IPV4
{$setType(IPV4);}
                 | (IPV6HASH) => IPV6HASH
{$setType(IPV6HASH);}
                 | (DATEVALUE) => DATEVALUE
{$setType(DATEVALUE);};

   protected DATEVALUE: '#'! ( ( (DIGIT)+ FSLASH ) => (DIGIT)+ FSLASH
(DIGIT)+ 
								( FSLASH
(DIGIT)+ )? WS )? 
							 (DIGIT)+ COLON
(DIGIT)+ 
								(COLON
(DIGIT)+ (DOT (DIGIT)+ )? )?
						'#'!;
         
   protected IPV4: '#'! (DIGIT)+ DOT (DIGIT)+ DOT (DIGIT)+ DOT (DIGIT)+ 
						(FSLASH (DIGIT)+ )? '#'!;
   /* Too messy to do IPv6 addresses any other way */
   protected IPV6HASH: '#'! (':' | '.' | HEX | "#\\"! WS! "#"! )+ 
						(FSLASH (DIGIT)+ )? '#'!;


I'm going to add action code then, which checks the tokens to ensure that
they are 'valid'. 

Of course, you might hit performance difficulties if a large number of
tokens pass through your syntatic predicate, but if you put the most common
token type first you will not be so severely affected.

I'll try and get back to you shortly with a more thorough answer/solution...
maybe someone else here has some ideas? Some of the grammars on the ANTLR
site have very long lexer definitions to deal with these sorts of issues. Of
course, that makes them very difficult to understand for everyone but the
author ;)

Nigel

--
Nigel Sheridan-Smith
PhD research student

Faculty of Engineering
University of Technology, Sydney
Phone: 02 9514 7946
Fax: 02 9514 2435
 

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Mark Bednarczyk
> Sent: Tuesday, 7 June 2005 10:16 AM
> To: ANTLR Interest
> Subject: RE: [antlr-interest] Help with pesky Lexer determinism
> 
> Well I have another problem that is a little more involved so
> maybe I can just get a couple of quick pointers. Same issue but
> now with IPv6 address that actually steps of the toes on the
> IDENT rule since IPv6 address is comprised of HEX digits so
> 'a'..'f' overlap with IDENT rule of 'a'..'z'.
> 
> BTW: here is the format of IPv6 for those not familiar, (HEX HEX
> COLON (COLON HEX HEX)+) in simple case.
> 
> This is what I'm trying to do, but not really sure how to code
> it.
> 
> 1) Add the IPv6 block to NUM_INT rule with appropriate predicate
> of (NUM_HEX_2DIGIT COLON NUM_HEX_2DIGIT COLON) and I do not get
> any warning from NUM_INT rule.
> 
> 2) Add propriate predicate to IDENT rele for IPv6 format (same
> as #1) and provide an empty condition block or tell some how
> based on the predicate to fail the IDENT rule so it will move on
> and try NUM_INT which should succeed.
> 
> Somehow I need the IDENT rule to fail on IPv6 address while
> matching on NUM_INT. Almost looks like I need to move both rules
> into a bigger common rule and manually set the token type.
> 
> Errors I'm getting now:
>     [antlr] ANTLR Parser Generator   Version 2.7.5 (20050128)
> 1989-2005 jGuru.com
>     [antlr]
> /home/markbe/prjs/jnetutils-0.1.0/src/antlr/npl/npl.g:
> warning:lexical nondeterminism between rules IDENT and NUM_INT
> upon
>     [antlr]
> /home/markbe/prjs/jnetutils-0.1.0/src/antlr/npl/npl.g:
> k==1:'A'..'F','a'..'f'
>     [antlr]
> /home/markbe/prjs/jnetutils-0.1.0/src/antlr/npl/npl.g:
> k==2:<end-of-token>,'0'..'9','A'..'F','L','X','a'..'f','l','x'
>     [antlr]
> /home/markbe/prjs/jnetutils-0.1.0/src/antlr/npl/npl.g:
> k==3:<end-of-token>,'0'..'9','A'..'F','L','a'..'f','l'
>     [antlr]
> /home/markbe/prjs/jnetutils-0.1.0/src/antlr/npl/npl.g:
> k==4:<end-of-token>,'0'..'9','A'..'F','L','a'..'f','l'
>     [antlr] warning: public lexical rule IDENT is optional (can
> match "nothing")
> 
> 
> And relative portion of the NUM_INT skipping the bottom since
> its not the problem and exactly the same as in java.g
> 
> IDENT
> options {
>     testLiterals=true;
> }
>     :   (NUM_HEX_2DIGIT COLON NUM_HEX_2DIGIT COLON)=>
>         // EMPTY match
>     |   ('a'..'z'|'A'..'Z'|'_'|'$')
> ('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'$')*
>     ;
> 
> 
> // a numeric literal
> NUM_INT
>     {boolean isDecimalulse; Token t=null;}
>     :   (NUM_3DIGIT '.' NUM_3DIGIT '.' NUM_3DIGIT '.'
> NUM_3DIGIT)=>
>         (
>             NUM_3DIGIT '.' NUM_3DIGIT '.' NUM_3DIGIT '.'
> NUM_3DIGIT
>             { _ttype = IP_V4; }
>         )
>     |   (NUM_HEX_2DIGIT COLON NUM_HEX_2DIGIT COLON)=>
>         (
>             NUM_HEX_2DIGIT (COLON NUM_HEX_2DIGIT)+
>             { _ttype = IP_V6; }
>         )
>  < T R U N K A T E D>
> 
> protected NUM_HEX_2DIGIT: HEX_DIGIT (HEX_DIGIT)?
> 
> 
> 




More information about the antlr-interest mailing list