[antlr-interest] lexical nondeterminism Problem

Ric Klaren klaren at cs.utwente.nl
Mon Mar 1 07:12:18 PST 2004


On Mon, Mar 01, 2004 at 02:52:42PM -0000, andreaszielke21 wrote:
> I'm trying to build a lexer that is able to differentiate between 
> numbers and dates.
> I'm not proficient using antlr-style regular expressions so I'll give 
> you a short description of the two tokens.
> The lexer should return a date token for anything that matches 
> dd?.mm?.yyyy i.e. 1.1.1900 or 05.12.2004.
> The lexer should return a number token for anything that matches
> x+(,x+)? i.e. 3,14 or 42.
> 
> My input file for antlr is this:
> ----------snip-------------------
> class TestLexer extends Lexer;
> options {
>   k=3;
> }
> 
> DATE  :	'0'..'9'('0'..'9')? '.' '0'..'9' ('0'..'9')? '.' 
> ('0'..'9''0'..'9''0'..'9''0'..'9')
>       { System.out.println("Found DATE:" + getText()); }
>       ;
> 
> NUM   : ('0'..'9')+(','('0'..'9')+)?
>       { System.out.println("Found NUM :" + getText()); }
>       ;
> ----------snip-------------------
> 
> I can understand that a lexer with a lookahead of one or two would 
> not be able to differentiate between the two token-types, but even 
> with a lookahead of three the antlr-Tool gives me the following error 
> message:
> antlr:
>     [antlr] ANTLR Parser Generator   Version 2.7.2   1989-2003 
> jGuru.com
>     [antlr] C:\PROGRA~1\eclipse\WORKSP~1\PLANKO~2
> \src\java\ch\forumedia\rpkc\blah.g: warning:lexical nondeterminism 
> between rules DATE and NUM upon
>     [antlr] C:\PROGRA~1\eclipse\WORKSP~1\PLANKO~2
> \src\java\ch\forumedia\rpkc\blah.g:     k==1:'0'..'9'
>     [antlr] C:\PROGRA~1\eclipse\WORKSP~1\PLANKO~2
> \src\java\ch\forumedia\rpkc\blah.g:     k==2:'0'..'9'
>     [antlr] C:\PROGRA~1\eclipse\WORKSP~1\PLANKO~2
> \src\java\ch\forumedia\rpkc\blah.g:     k==3:'0'..'9'
> 
> Why is the inputfile nondeterministic? If the third character is a 
> digit, it cannot be a DATE-Token!?!

For a lexer all the non protected rules are combined in one rule/method
(nextToken). So the different non protected rules should not share the same
lookahead. You can solve this by leftfactoring.

DATE_OR_NUM  :	
	( '0'..'9' )+ { $setType(NUM); }
	( { $getText().length <= 2 }? '.' '0'..'9' ('0'..'9')? '.' 
	  ('0'..'9''0'..'9''0'..'9''0'..'9')
	  { $setType(DATE); System.out.println("Found DATE:" + getText()); }
	|   
	   (','('0'..'9')+)?
      { System.out.println("Found NUM :" + getText()); }
	)		
;

Did not test this it might contain an error or two. Also not sure if the
$getText().length trick works. Might have to check that in the normal action.

An alternative solution is using a predicate (peforms less but is more
readable):

DATE_OR_NUM  :	
	( ('0'..'9' ('0'..'9')? '.') => 
		'0'..'9' ('0'..'9')? '.' '0'..'9' ('0'..'9')? '.' 
	  ('0'..'9''0'..'9''0'..'9''0'..'9')
	  { $setType(DATE); System.out.println("Found DATE:" + getText()); }
	|   ( '0'..'9' )+ 
	   (','('0'..'9')+)?
      { $setType(NUM); System.out.println("Found NUM :" + getText()); }
	)		
;

Check the FAQ's on jGuru.com for more on this topic.

HTH,

Ric
-- 
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893722  ----
-----+++++*****************************************************+++++++++-------
  "Good judgement comes from experience.
     Experience comes from bad judgement." --- Unknown



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list