[antlr-interest] Using long literal definitions in ANTLR

Thu Nov 23 16:35:04 PST 2006

I'm using an older version of ANTLR 2.7.2 to build a relatively simple
forms parser, but I've come across an issue that has stumped me for a
couple of days now.  I'm no expert on parsers so I'm probably missing
something very obvious - so any help would be gratefully received.  The
issue stems for a set of long, similar literals in the form definition.

The grammer has a number of definitions for 'form' style including
"DOULBLE_THIN",  "DOUBLE_THICK",  "DOUBLE" and "DOUBLEUNDER".  To
accommodate the long literal length allowing ANTLR to distinguish
between the literals i've upped k to 11, but I'm still getting
nondeterminism between rules FRAMESTYLETYPE and STYLETYPE.  And when the
parser is run I get an exception in the generated  nextToken() operation
in the lexer.

For example, the parser may be parsing a STYLE line and expects one of
the STYLETYPES to follow but due to the order of the rules in the
grammer nextToken() matches the DOUBLE (from DOUBLE_THIN) from the
FRAMESTYLETYPE first and throws an exception.  Changing the rules order
isn't an option because it will throw an exception on the FRAMESTYLE
line instead.

I'm sure this is a problem because i've increased k but I'm at a loss at
any alternative strategies.  

Cheers,

Andy

---------------------------------------------------------

Below is a simplified version of the grammer...

options

{

language = "CSharp";

}

class TestParser extends Parser;

form:             (formbody)+ ;

formbody:         BEGIN formcontent END  ;

formcontent:      (formentry)+ ;

formentry :       styleline

                  | framestyleline;

styleline:              STYLE style1:STYLETYPE;

framestyleline:         FRAMESTYLE style:FRAMESTYLETYPE;

class TestLexer extends Lexer;

options

{

      k = 11;

      charVocabulary = '\3'..'\377';

      charVocabulary = '\u0000'..'\uFFFE';

}

BEGIN:                  "BEGIN" ;

END:                    "END" ;

STYLE:                  "STYLE" ;

FRAMESTYLE:             "FRAMESTYLE" ;

FRAMESTYLETYPE:         "SINGLE_THIN" | "DOUBLE_THIN" | "SINGLE_THICK" |
"DOUBLE_THICK" | "DOTTED" ;

STYLETYPE:              "NORMAL" | "BOLD" | "ITALIC" | "UNDER" |
"DOUBLEUNDER" | "DOUBLE" | "TRIPLE" | "QUADRUPLE" |

                        "STRIKETHROUGH" | "ROTATE90" | "ROTATE270" |
"UPSIDEDOWN" | "PROPORTIONAL" | "DOUBLEHIGH" |

                        "TRIPLEHIGH" | "QUADRUPLEHIGH" | "CONDENSED" |
"SUPERSCRIPT" | "OVERSCORE" | "LETTERQUALITY" |

                        "NEARLETTERQUALITY" | "DOUBLESTRIKE" | "OPAQUE"
;

WS : (' ' | '\t')+  { $setType(Token.SKIP); }

            ;

NEWLINE

    :   '\r' '\n' { newline(); $setType(Token.SKIP);}

    |   '\n' { newline(); $setType(Token.SKIP);}            

    |   '\r' { newline(); $setType(Token.SKIP);}            

     ;

------------------------------------------------

            public new Token nextToken()              //throws
TokenStreamException

            {

                  ...

                                    default:

                                          if ((LA(1)=='D'||LA(1)=='S')
&& (LA(2)=='I'||LA(2)=='O') && (LA(3)=='N'||LA(3)=='T'||LA(3)=='U') &&
(LA(4)=='B'||LA(4)=='G'||LA(4)=='T') && (LA(5)=='E'||LA(5)=='L') &&
(LA(6)=='D'||LA(6)=='E') && (true) && (true) && (true) && (true) &&
(true))

                                          {

                                                mFRAMESTYLETYPE(true);

                                                theRetToken =
returnToken_;

                                          }

                                          else if
((tokenSet_0_.member(LA(1))) && (tokenSet_1_.member(LA(2))) &&
(tokenSet_2_.member(LA(3))) && (tokenSet_3_.member(LA(4))) && (true) &&
(true) && (true) && (true) && (true) && (true) && (true)) {

                                                mSTYLETYPE(true);

                                                theRetToken =
returnToken_;

                                          }

                                          else if ((LA(1)=='B') &&
(LA(2)=='E') && (LA(3)=='G')) {

                                                mBEGIN(true);

                                                theRetToken =
returnToken_;

                                          }

                                          else if ((LA(1)=='S') &&
(LA(2)=='T') && (LA(3)=='Y')) {

                                                mSTYLE(true);

                                                theRetToken =
returnToken_;

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20061124/c0245e6e/attachment-0001.html