[antlr-interest] Lexer not as smart as it should be?

Gerrit E.G. Hobbelt Ger.Hobbelt at bermuda-holding.com
Thu Oct 20 17:50:46 PDT 2005


Take as an example the input

4

which is a single INT.
To which token should the lexer resolve this one? It matches the rule (INT)+
as it is 1(one) occurrence of INT, so it's an INTEGER, but OTOH, it also
matches INT (INT (INT)?)? as it is a single INT, thus it's an IPSEG.

same for an input of two INTs and an input of three INTs: there's no way
this set of rules uniquely identifies the token to pick. So antlr yaks about
it by yelling nondeterminism ;-)

A bit of a hack (imho) is to define an INTEGER token to equal an IPSEG (one
to 3 digits) or a collection of 4 or more digits. You might try something
like this:

INTEGER: IPSEG (INT)* ;

but then you should consider how and where you wish to validate the 'IPSEG'
range of acceptable values (0 to 255 inclusive for IPv4), so, when you
choose to validate the IP number using a bit of code you might ultimately
reduce the ruleset to

IP :
    INTEGER '.' INTEGER '.' INTEGER '.' INTEGER '/' INTEGER
  ;

which in itself would still be valid when you upgrade your implementation to
IPv6 support ;-)


Anyway, when you like to strictly differentiate between IPSEG and INTEGER in
your rule-set, you'll have to define a complementing definition for INTEGER
to give the lexer a shot at uniquely identfying the token at hand, so if you
used

IPSEG
  :
    INT
  | (INT INT)
  | ('0'..'2' '0'..'5' INT)
  ;

it'd require this extension/complement for an INTEGER ==>

INTEGER:
    IPSEG // up to 3 digits: '25x'
    | ('0'..'2' '6'..'9' INT) // the other 3-digit combo's
    | ('3'..'9' INT INT)
    | INT INT INT INT+ ; // 4 or more digits


It's rather late now, so I might have erred. Hope it helps anyway,

Ger



----- Original Message -----
From: Jon Schewe
To: ANTLR Interest
Sent: Friday, 21 October 2005 01:03
Subject: [antlr-interest] Lexer not as smart as it should be?


I'm trying to handle IP addresses in my lexer.  Below are the important
rules.


INTEGER
  :
    ( INT )+
  ;

protected
INT
  :
    '0'..'9'
  ;

protected
IPSEG
  :
    INT (INT (INT)? )?
  ;

IP :
    IPSEG '.' IPSEG '.' IPSEG '.' IPSEG '/' INT (INT)?
  ;


This yields these errors:
ANTLR Parser Generator   Version 2.7.4   1989-2004 jGuru.com
/net/users/jschewe/projects/argus/working-dir/java/src/scyllarus/verter/snor
t/snort_rule.g: warning:lexical nondeterminism between rules INTEGER and IP
upon
/net/users/jschewe/projects/argus/working-dir/java/src/scyllarus/verter/snor
t/snort_rule.g:     k==1:'0'..'9'
/net/users/jschewe/projects/argus/working-dir/java/src/scyllarus/verter/snor
t/snort_rule.g:     k==2:'0'..'9'
/net/users/jschewe/projects/argus/working-dir/java/src/scyllarus/verter/snor
t/snort_rule.g:     k==3:'0'..'9'
/net/users/jschewe/projects/argus/working-dir/java/src/scyllarus/verter/snor
t/snort_rule.g:     k==4:'0'..'9'

Why does the lexer even consider IP after the fourth number?  Since IPSEG
can only be a maximum of 3 integers?

I even tried using this rule for IPSEG and it still can't figure it out.
protected
IPSEG
  :
    INT
  | (INT INT)
  | ('0'..'2' '0'..'5' INT)
  ;





Jon Schewe | http://www.mtu.net/~jpschewe
If you see a signature.asc file attached to the message this is my digital
signature.
See http://www.gnupg.org for more information.

For I am convinced that neither death nor life, neither angels
nor demons, neither the present nor the future, nor any
powers, neither height nor depth, nor anything else in all
creation, will be able to separate us from the love of God that
is in Christ Jesus our Lord. - Romans 8:38-39



More information about the antlr-interest mailing list