[antlr-interest] lexical nondeterminism Problem
Ric Klaren
klaren at cs.utwente.nl
Mon Mar 1 07:12:18 PST 2004
On Mon, Mar 01, 2004 at 02:52:42PM -0000, andreaszielke21 wrote:
> I'm trying to build a lexer that is able to differentiate between
> numbers and dates.
> I'm not proficient using antlr-style regular expressions so I'll give
> you a short description of the two tokens.
> The lexer should return a date token for anything that matches
> dd?.mm?.yyyy i.e. 1.1.1900 or 05.12.2004.
> The lexer should return a number token for anything that matches
> x+(,x+)? i.e. 3,14 or 42.
>
> My input file for antlr is this:
> ----------snip-------------------
> class TestLexer extends Lexer;
> options {
> k=3;
> }
>
> DATE : '0'..'9'('0'..'9')? '.' '0'..'9' ('0'..'9')? '.'
> ('0'..'9''0'..'9''0'..'9''0'..'9')
> { System.out.println("Found DATE:" + getText()); }
> ;
>
> NUM : ('0'..'9')+(','('0'..'9')+)?
> { System.out.println("Found NUM :" + getText()); }
> ;
> ----------snip-------------------
>
> I can understand that a lexer with a lookahead of one or two would
> not be able to differentiate between the two token-types, but even
> with a lookahead of three the antlr-Tool gives me the following error
> message:
> antlr:
> [antlr] ANTLR Parser Generator Version 2.7.2 1989-2003
> jGuru.com
> [antlr] C:\PROGRA~1\eclipse\WORKSP~1\PLANKO~2
> \src\java\ch\forumedia\rpkc\blah.g: warning:lexical nondeterminism
> between rules DATE and NUM upon
> [antlr] C:\PROGRA~1\eclipse\WORKSP~1\PLANKO~2
> \src\java\ch\forumedia\rpkc\blah.g: k==1:'0'..'9'
> [antlr] C:\PROGRA~1\eclipse\WORKSP~1\PLANKO~2
> \src\java\ch\forumedia\rpkc\blah.g: k==2:'0'..'9'
> [antlr] C:\PROGRA~1\eclipse\WORKSP~1\PLANKO~2
> \src\java\ch\forumedia\rpkc\blah.g: k==3:'0'..'9'
>
> Why is the inputfile nondeterministic? If the third character is a
> digit, it cannot be a DATE-Token!?!
For a lexer all the non protected rules are combined in one rule/method
(nextToken). So the different non protected rules should not share the same
lookahead. You can solve this by leftfactoring.
DATE_OR_NUM :
( '0'..'9' )+ { $setType(NUM); }
( { $getText().length <= 2 }? '.' '0'..'9' ('0'..'9')? '.'
('0'..'9''0'..'9''0'..'9''0'..'9')
{ $setType(DATE); System.out.println("Found DATE:" + getText()); }
|
(','('0'..'9')+)?
{ System.out.println("Found NUM :" + getText()); }
)
;
Did not test this it might contain an error or two. Also not sure if the
$getText().length trick works. Might have to check that in the normal action.
An alternative solution is using a predicate (peforms less but is more
readable):
DATE_OR_NUM :
( ('0'..'9' ('0'..'9')? '.') =>
'0'..'9' ('0'..'9')? '.' '0'..'9' ('0'..'9')? '.'
('0'..'9''0'..'9''0'..'9''0'..'9')
{ $setType(DATE); System.out.println("Found DATE:" + getText()); }
| ( '0'..'9' )+
(','('0'..'9')+)?
{ $setType(NUM); System.out.println("Found NUM :" + getText()); }
)
;
Check the FAQ's on jGuru.com for more on this topic.
HTH,
Ric
--
-----+++++*****************************************************+++++++++-------
---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893722 ----
-----+++++*****************************************************+++++++++-------
"Good judgement comes from experience.
Experience comes from bad judgement." --- Unknown
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list