[antlr-interest] Help with pesky Lexer determinism

Fri Jun 10 07:09:57 PDT 2005

Nigel thanks for the reply,
	Yes I am aware of the various formatting possibilities for IPv6
and I've worked out how to support all the forms except the IPv4
equivelency.

I merged the IPv4, IPv6 and IDENT rules into NUM_INT rule and
thus using predicates am able to differentiated without
warnings. The symantical part will be left for higher level
object to check and throw invalid IPv4 or IPv6 formats which is
probably a better place to do throw an error message anyway
since you can specifically say what type of address is
incorrectly formatted.

I don't want to go overboard with all the various forms and this
should be good enough:

protected NUM_3DIGIT: ('1'..'9') (('0'..'9') ('0'..'9')?)?
    ;

protected NUM_HEX_4DIGIT: HEX_DIGIT ((HEX_DIGIT) ((HEX_DIGIT)
(HEX_DIGIT)?)?)?
    ;

NUM_IPADDR_IDENT_COLON
options {
    testLiterals = true;
}
    {boolean isDecimal=false; Token t=null; }

    /* IPv4 RULE */
    :   (NUM_3DIGIT '.' NUM_3DIGIT '.')=>
        (
            NUM_3DIGIT '.' NUM_3DIGIT '.' NUM_3DIGIT '.'
NUM_3DIGIT
            { _ttype = IP_V4; }
        )

    /* IPv6 RULE */
    |
        (NUM_HEX_4DIGIT ':')=>
        (options { greedy = true; } :

            ((NUM_HEX_4DIGIT ':')+ ':')=>
            (NUM_HEX_4DIGIT ':')+ ':'
            (NUM_HEX_4DIGIT (':' NUM_HEX_4DIGIT)*)?

        |   NUM_HEX_4DIGIT (':' NUM_HEX_4DIGIT)+
        )
            { _ttype = IP_V6; }

        |   (':' ':' NUM_HEX_4DIGIT)=>
            (options { greedy = true; } :
            ':' ':'
            NUM_HEX_4DIGIT (':' NUM_HEX_4DIGIT)*
            { _ttype = IP_V6; }
            )

        |   ':' ':'
            { _ttype = IP_V6; }

        |   ':'
            { _ttype = COLON; }
        /* IDENT rule */
        |   ('a'..'z'|'A'..'Z'|'_'|'$')
('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'$')*
            { _ttype = IDENT; }

    /* Number beginning with '.' rule */
    |   '.' {_ttype = DOT;}
            (   ('0'..'9')+ (EXPONENT)? (f1:FLOAT_SUFFIX
{t=f1;})?
                {
                if (t != null &&
t.getText().toUpperCase().indexOf('F')>=0) {
                    _ttype = NUM_FLOAT;
                }
                else {
                    _ttype = NUM_DOUBLE; // assume double
                }
                }
            )?

    /* Number beginning with a 0 rule */
    |   (   '0' {isDecimal = true;} // special case for just '0'
            (   ('x'|'X')
                (                                           //
hex
                    // the 'e'|'E' and float suffix stuff look
                    // like hex digits, hence the (...)+ doesn't
                    // know when to stop: ambig.  ANTLR resolves
                    // it correctly by matching immediately.  It
                    // is therefor ok to hush warning.
                    options {
                        warnWhenFollowAmbig=false;
                    }
                :   HEX_DIGIT
                )+

            |   //float or double with leading zero
                (('0'..'9')+ ('.'|EXPONENT|FLOAT_SUFFIX)) =>
('0'..'9')+

            |   ('0'..'7')+                                 //
octal
            )?

        /* A regular number non-zero starting rule */
        |   ('1'..'9') ('0'..'9')*  {isDecimal=true;}       //
non-zero decimal
        )
        (   ('l'|'L') { _ttype = NUM_LONG; }
        // only check to see if it's a float if looks like
decimal so far
        |   {isDecimal}?
            (   '.' ('0'..'9')* (EXPONENT)? (f2:FLOAT_SUFFIX
{t=f2;})?
            |   EXPONENT (f3:FLOAT_SUFFIX {t=f3;})?
            |   f4:FLOAT_SUFFIX {t=f4;}
            )
            {
            if (t != null && t.getText().toUpperCase()
.indexOf('F') >= 0) {
                _ttype = NUM_FLOAT;
            }
            else {
                _ttype = NUM_DOUBLE; // assume double
            }
            }
        )?
    ;

>-----Original Message-----
>From: antlr-interest-bounces at antlr.org
>[mailto:antlr-interest-bounces at antlr.org]On Behalf Of Nigel
>Sheridan-Smith
>Sent: Tuesday, June 07, 2005 6:53 AM
>To: 'ANTLR Interest'
>Subject: RE: [antlr-interest] Help with pesky Lexer determinism
>
>
>
>Hey Mark,
>
>Something to be aware of: IPv6 addressing rules are
>very flexible... I came
>across this a few months ago...
>
>>From RFC 2373 (http://www.ietf.org/rfc/rfc2373.txt),
>the following are all
>valid addresses:
>
>         1080:0:0:0:8:800:200C:417A  a unicast address
>         FF01:0:0:0:0:0:0:101        a multicast address
>         0:0:0:0:0:0:0:1             the loopback address
>         0:0:0:0:0:0:0:0             the unspecified addresses
>
>shortened versions:
>        1080::8:800:200C:417A       a unicast address
>         FF01::101                   a multicast address
>         ::1                         the loopback address
>         ::                          the unspecified addresses
>
>And IPv4 equivalence:
>
>	   0:0:0:0:0:0:13.1.68.3
>
>         0:0:0:0:0:FFFF:129.144.52.38
>
>      or in compressed form:
>
>         ::13.1.68.3
>
>         ::FFFF:129.144.52.38
>
>
>and then there are prefixes and things to contend with
>as well, depending on
>your language.
>
>Currently, my language supports IPv4/v6 addressing to
>some degree, but its
>not finished and I've chosen to use '#' delimiters to
>avoid conflicts with
>numerical types. To deal with this issue, I just used
>a generic token
>matcher (that disambiguates IPv4, IPv6, dates and
>'hashed' or binary data,
>which all use the '#' delimiters), and then I'll add
>some more semantic
>checks further down the chain. However, you may not be
>at the same liberty
>in having such delimiters (depending on your
>requirements handed to you),
>and so you will hit non-determinisms pretty quickly!
>
>The best way to deal with this sort of thing is start
>with a protected lexer
>definitions, and combine them into one rule:
>
>   IPADDRHASHDATE: (IPV4) => IPV4
>{$setType(IPV4);}
>                 | (IPV6HASH) => IPV6HASH
>{$setType(IPV6HASH);}
>                 | (DATEVALUE) => DATEVALUE
>{$setType(DATEVALUE);};
>
>   protected DATEVALUE: '#'! ( ( (DIGIT)+ FSLASH ) =>
>(DIGIT)+ FSLASH
>(DIGIT)+
>
>	( FSLASH
>(DIGIT)+ )? WS )?
>
> (DIGIT)+ COLON
>(DIGIT)+
>
>	(COLON
>(DIGIT)+ (DOT (DIGIT)+ )? )?
>						'#'!;
>
>   protected IPV4: '#'! (DIGIT)+ DOT (DIGIT)+ DOT
>(DIGIT)+ DOT (DIGIT)+
>						(FSLASH
>(DIGIT)+ )? '#'!;
>   /* Too messy to do IPv6 addresses any other way */
>   protected IPV6HASH: '#'! (':' | '.' | HEX | "#\\"!
>WS! "#"! )+
>						(FSLASH
>(DIGIT)+ )? '#'!;
>
>
>I'm going to add action code then, which checks the
>tokens to ensure that
>they are 'valid'.
>
>Of course, you might hit performance difficulties if a
>large number of
>tokens pass through your syntatic predicate, but if
>you put the most common
>token type first you will not be so severely affected.
>
>I'll try and get back to you shortly with a more
>thorough answer/solution...
>maybe someone else here has some ideas? Some of the
>grammars on the ANTLR
>site have very long lexer definitions to deal with
>these sorts of issues. Of
>course, that makes them very difficult to understand
>for everyone but the
>author ;)
>
>Nigel
>
>--
>Nigel Sheridan-Smith
>PhD research student
>
>Faculty of Engineering
>University of Technology, Sydney
>Phone: 02 9514 7946
>Fax: 02 9514 2435
>
>
>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org
>[mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of Mark Bednarczyk
>> Sent: Tuesday, 7 June 2005 10:16 AM
>> To: ANTLR Interest
>> Subject: RE: [antlr-interest] Help with pesky Lexer
>determinism
>>
>> Well I have another problem that is a little more involved so
>> maybe I can just get a couple of quick pointers.
>Same issue but
>> now with IPv6 address that actually steps of the toes on the
>> IDENT rule since IPv6 address is comprised of HEX digits so
>> 'a'..'f' overlap with IDENT rule of 'a'..'z'.
>>
>> BTW: here is the format of IPv6 for those not
>familiar, (HEX HEX
>> COLON (COLON HEX HEX)+) in simple case.
>>
>> This is what I'm trying to do, but not really sure
>how to code
>> it.
>>
>> 1) Add the IPv6 block to NUM_INT rule with
>appropriate predicate
>> of (NUM_HEX_2DIGIT COLON NUM_HEX_2DIGIT COLON) and I
>do not get
>> any warning from NUM_INT rule.
>>
>> 2) Add propriate predicate to IDENT rele for IPv6
>format (same
>> as #1) and provide an empty condition block or tell some how
>> based on the predicate to fail the IDENT rule so it
>will move on
>> and try NUM_INT which should succeed.
>>
>> Somehow I need the IDENT rule to fail on IPv6 address while
>> matching on NUM_INT. Almost looks like I need to
>move both rules
>> into a bigger common rule and manually set the token type.
>>
>> Errors I'm getting now:
>>     [antlr] ANTLR Parser Generator   Version 2.7.5 (20050128)
>> 1989-2005 jGuru.com
>>     [antlr]
>> /home/markbe/prjs/jnetutils-0.1.0/src/antlr/npl/npl.g:
>> warning:lexical nondeterminism between rules IDENT
>and NUM_INT
>> upon
>>     [antlr]
>> /home/markbe/prjs/jnetutils-0.1.0/src/antlr/npl/npl.g:
>> k==1:'A'..'F','a'..'f'
>>     [antlr]
>> /home/markbe/prjs/jnetutils-0.1.0/src/antlr/npl/npl.g:
>>
>k==2:<end-of-token>,'0'..'9','A'..'F','L','X','a'..'f','l','x'
>>     [antlr]
>> /home/markbe/prjs/jnetutils-0.1.0/src/antlr/npl/npl.g:
>> k==3:<end-of-token>,'0'..'9','A'..'F','L','a'..'f','l'
>>     [antlr]
>> /home/markbe/prjs/jnetutils-0.1.0/src/antlr/npl/npl.g:
>> k==4:<end-of-token>,'0'..'9','A'..'F','L','a'..'f','l'
>>     [antlr] warning: public lexical rule IDENT is
>optional (can
>> match "nothing")
>>
>>
>> And relative portion of the NUM_INT skipping the bottom since
>> its not the problem and exactly the same as in java.g
>>
>> IDENT
>> options {
>>     testLiterals=true;
>> }
>>     :   (NUM_HEX_2DIGIT COLON NUM_HEX_2DIGIT COLON)=>
>>         // EMPTY match
>>     |   ('a'..'z'|'A'..'Z'|'_'|'$')
>> ('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'$')*
>>     ;
>>
>>
>> // a numeric literal
>> NUM_INT
>>     {boolean isDecimalulse; Token t=null;}
>>     :   (NUM_3DIGIT '.' NUM_3DIGIT '.' NUM_3DIGIT '.'
>> NUM_3DIGIT)=>
>>         (
>>             NUM_3DIGIT '.' NUM_3DIGIT '.' NUM_3DIGIT '.'
>> NUM_3DIGIT
>>             { _ttype = IP_V4; }
>>         )
>>     |   (NUM_HEX_2DIGIT COLON NUM_HEX_2DIGIT COLON)=>
>>         (
>>             NUM_HEX_2DIGIT (COLON NUM_HEX_2DIGIT)+
>>             { _ttype = IP_V6; }
>>         )
>>  < T R U N K A T E D>
>>
>> protected NUM_HEX_2DIGIT: HEX_DIGIT (HEX_DIGIT)?
>>
>>
>>
>
>
>