[antlr-interest] Ambiguity error in lexer generation

Alex Kinneer kinneera at hotmail.com
Thu Sep 20 09:51:31 PDT 2007








It is likely that they ARE consistently reported, but they are not consistently making it to the screen, in the Netbeans IDE output – why don’t you try running it from the command line? Also, are you sure you are not making any changes at all to the lexer?
I'm editing the grammar in a text editor, and running antlr from a command line (linux/bash). I was never using NetBeans. So if it is the case that the output just isn't getting flushed somewhere, it is definitely a problem in somewhere in the output handling of antlr itself.

Many times, you may feel that you have disambiguated with a synpred, but you haven’t – this is especially the case if your lexer rule has only one alternative – antlr will say that as there is only the one alternative, there is no point in using the synpred and ignore it. You need to combine the things you are trying to disambiguate into the one rule with a common leadin and then you probably won’t need the synpred anyway.
The rule for UNQUOTED_STRING has a semantic predicate, not a syntactic predicate. I'll explain a bit more below.

It sounds to me like your rule for unquoted string is the same as the one for identifier, but unless you publish it for us, we can’t really help beyond that.
 
Jim
Below is the grammar (I have done a bit of name mangling in the grammar rules, but I have confirmed that it produces the same warnings/errors). Of course, it is really only a fragment of the intended final grammar, but it's enough to illustrate where I got stuck. The UNQUOTED_STRING rule is really only there to handle legacy input. Basically, the idea is that on certain tokens, it bumps up the counter to permit a certain number of subsequent tokens matching UNQUOTED_STRING to be read in (usually only one). Really, wherever this is done, it should have just required quoted strings, but it needs to support existing files (at least for now), which may not have used that convention.  Everything else is pretty routine (at least to my understanding) -- some keywords that (I thought antlr should deal with ensuring) take precedence over JAVA_IDs (borrowed from the Java 1.5 grammar).
 
> grammar TestLang;> > @lexer::members {> private int anyStringMatchCnt = 0;> }> > ///////////////////////////////////////////////////////////////////////////////> // Parser> > parse> : kanseicnwltHnsdt+ ;> > kanseicnwltHnsdt> : ('+' | '!') jwtntDecl> ;> > jwtntDecl> : nyg_saaevd__decl> | cnsiwnvhw_saaevd__decl> | cnsiwnvhw_ansucj__decl> | gwo_sxnslw__decl> | pbh_sxnslw__decl> | gwo_nsucn__decl> | pbh_nsucn__decl> | cnsiwnvhwbn_vuyw__decl> | sxnslw_vuyw__decl> | veiqncb_vuyw__decl> | iiwnvcxbw_vuyw__decl> | veiqncb_pwjncs_izsns__decl> | veiqncb_pwjncs_anse__decl> | sxnslw_pwjncs_izsns__decl> | sxnslw_pwjncs_anse__decl> | musnvbs_runslkw__decl> | musnvbs_lakwent__decl> | musnvbs_flr_paisndg__decl> | musnvbs_paisndg__decl> | tlshn__decl> | ccjed__decl> | sxnslw_yshf_izsns__decl> ;> > nyg_saaevd__decl> : 'nyg_saaevd'> ;> > cnsiwnvhw_saaevd__decl> : 'cnsiwnvhw_xlsjke'> ;> > cnsiwnvhw_ansucj__decl> : 'cnsiwnvhw_ansucj'> ;> > gwo_sxnslw__decl> : 'gwo_sxnslw'> ;> > pbh_sxnslw__decl> : 'pbh_sxnslw'> ;> > gwo_nsucn__decl> : 'gwo_nsucn'> ;> > pbh_nsucn__decl> : 'pbh_nsucn'> ;> > cnsiwnvhwbn_vuyw__decl> : 'cnsiwnvhwbn_vuyw'> ;> > sxnslw_vuyw__decl> : 'sxnslw_vuyw'> ;> > veiqncb_vuyw__decl> : 'veiqncb_vuyw'> ;> > iiwnvcxbw_vuyw__decl> : 'iiwnvcxbw_vuyw'> ;> > veiqncb_pwjncs_izsns__decl> : 'veiqncb_pwjncs_izsns'> ;> > veiqncb_pwjncs_anse__decl> : 'veiqncb_pwjncs_anse'> ;> > sxnslw_pwjncs_izsns__decl> : 'sxnslw_pwjncs_izsns'> ;> > sxnslw_pwjncs_anse__decl> : 'sxnslw_pwjncs_anse'> ;> > musnvbs_runslkw__decl> : 'musnvbs_runslkw'> ;> > musnvbs_lakwent__decl> : 'musnvbs_lakwent'> ;> > musnvbs_flr_paisndg__decl> : 'musnvbs_flr_paisndg'> ;> > musnvbs_paisndg__decl> : 'musnvbs_paisndg'> ;> > tlshn__decl> : 'tlshn'> ;> > ccjed__decl> : 'ccjed'> ;> > sxnslw_yshf_izsns__decl> : 'sxnslw_yshf_izsns'> ;> > > ///////////////////////////////////////////////////////////////////////////////> // Lexer> > WS : WS_CHARS {$channel=HIDDEN;} ;> > fragment> WS_CHARS : (' ' | '\t' | '\u000C' | '\n') ;> > LINE_COMMENT> : '//' ~('\n' | '\r')* '\r'? '\n' {$channel=HIDDEN;} ;> > fragment> LETTER> : '\u0024' |> '\u0041'..'\u005a' |> '\u005f' |> '\u0061'..'\u007a' |> '\u00c0'..'\u00d6' |> '\u00d8'..'\u00f6' |> '\u00f8'..'\u00ff' |> '\u0100'..'\u1fff' |> '\u3040'..'\u318f' |> '\u3300'..'\u337f' |> '\u3400'..'\u3d2d' |> '\u4e00'..'\u9fff' |> '\uf900'..'\ufaff'> ;> > fragment> JAVA_ID_DIGIT> : '\u0030'..'\u0039' |> '\u0660'..'\u0669' |> '\u06f0'..'\u06f9' |> '\u0966'..'\u096f' |> '\u09e6'..'\u09ef' |> '\u0a66'..'\u0a6f' |> '\u0ae6'..'\u0aef' |> '\u0b66'..'\u0b6f' |> '\u0be7'..'\u0bef' |> '\u0c66'..'\u0c6f' |> '\u0ce6'..'\u0cef' |> '\u0d66'..'\u0d6f' |> '\u0e50'..'\u0e59' |> '\u0ed0'..'\u0ed9' |> '\u1040'..'\u1049'> ;> > UNQUOTED_STRING> : {TestLangLexer.this.anyStringMatchCnt> 0}?> (~('"' | WS_CHARS))+> {TestLangLexer.this.anyStringMatchCnt--;}> ;> > JAVA_ID> : LETTER (LETTER | JAVA_ID_DIGIT)*> ;
 
 
The full error output I get (sometimes) is the following:
 
> ANTLR Parser Generator Version 3.0.1 (August 13, 2007) 1989-2007> warning(205): TestLang.g:1:8: ANTLR could not analyze this decision in rule Tokens; often this is because of recursive rule references visible from the left edge of alternatives. ANTLR will re-analyze the decision with a fixed lookahead of k=1. Consider using "options {k=1;}" for that decision and possibly adding a syntactic predicate.> warning(209): TestLang.g:20:1: Multiple token rules can match input such as "'v'": T22, T24, T25, UNQUOTED_STRING, JAVA_ID> As a result, tokens(s) JAVA_ID,UNQUOTED_STRING,T24,T25 were disabled for that input> warning(209): TestLang.g:13:1: Multiple token rules can match input such as "'g'": T16, T18, UNQUOTED_STRING, JAVA_ID> As a result, tokens(s) JAVA_ID,UNQUOTED_STRING,T18 were disabled for that input> warning(209): TestLang.g:28:1: Multiple token rules can match input such as "'c'": T14, T15, T20, T33, UNQUOTED_STRING, JAVA_ID> As a result, tokens(s) JAVA_ID,T20,UNQUOTED_STRING,T15,T33 were disabled for that input> warning(209): TestLang.g:22:1: Multiple token rules can match input such as "'s'": T21, T26, T27, T34, UNQUOTED_STRING, JAVA_ID> As a result, tokens(s) JAVA_ID,UNQUOTED_STRING,T26,T34,T27 were disabled for that input> warning(209): TestLang.g:178:1: Multiple token rules can match input such as "'t'": T32, UNQUOTED_STRING, JAVA_ID> As a result, tokens(s) JAVA_ID,UNQUOTED_STRING were disabled for that input> warning(209): TestLang.g:178:1: Multiple token rules can match input such as "'n'": T13, UNQUOTED_STRING, JAVA_ID> As a result, tokens(s) JAVA_ID,UNQUOTED_STRING were disabled for that input> warning(209): TestLang.g:178:1: Multiple token rules can match input such as "'i'": T23, UNQUOTED_STRING, JAVA_ID> As a result, tokens(s) JAVA_ID,UNQUOTED_STRING were disabled for that input> warning(209): TestLang.g:14:1: Multiple token rules can match input such as "'p'": T17, T19, UNQUOTED_STRING, JAVA_ID> As a result, tokens(s) JAVA_ID,UNQUOTED_STRING,T19 were disabled for that input> warning(209): TestLang.g:26:1: Multiple token rules can match input such as "'m'": T28, T29, T30, T31, UNQUOTED_STRING, JAVA_ID> As a result, tokens(s) JAVA_ID,UNQUOTED_STRING,T30,T29,T31 were disabled for that input> warning(208): TestLang.g:29:1: The following token definitions are unreachable: T15,T18,T19,T20,T24,T25,T26,T27,T29,T30,T31,T33,T34
 
Even more suspicious is that if I start removing the grammar productions that contain just keywords, below some number of remaining productions, it doesn't seem to generate the warnings anymore. There's no apparent correlation with the rules themselves -- it just seems to matter how many of them there are (and only some of the time). 
 

 


From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Alex KinneerSent: 19 September 2007 18:55To: Loring Craymer; antlr-interest at antlr.orgSubject: Re: [antlr-interest] Ambiguity error in lexer generation
 
I understand the warnings (at least I'm pretty sure I do), but I'm still not clear on why they are being reported, and more importantly why they are not being reported consistently. What I am trying to emphasize is that if I run antlr on the same grammar file multiple times, I sometimes get these warnings, and sometimes don't. And that seems like a bug to me. Either the lexer rules are ambiguous, or they aren't, right? So why would it sometimes say they are and sometimes not, when I'm just running antlr on the exact same grammar? More importantly, I don't think the lexer rules even are ambiguous, except to the extent that antlr advertises it can resolve automatically without warning. For example, as best I can tell, the grammar doesn't specify anything more ambiguous than what the Java 1.5 grammar for antlr 3.0 does (e.g. how does antlr distinguish the keyword 'class' from an Identifier in that grammar), yet the Java 1.5 grammar doesn't seem to produce any warnings (at least in that regard). -Alex
 
_________________________________________________________________
Capture your memories in an online journal!
http://www.reallivemoms.com?ocid=TXT_TAGHM&loc=us
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20070920/8fa7ed5e/attachment-0001.html 


More information about the antlr-interest mailing list