[antlr-interest] Using long literal definitions in ANTLR
Andrew Monaghan
andrew.monaghan at asktheobvious.com
Thu Nov 23 16:35:04 PST 2006
I'm using an older version of ANTLR 2.7.2 to build a relatively simple
forms parser, but I've come across an issue that has stumped me for a
couple of days now. I'm no expert on parsers so I'm probably missing
something very obvious - so any help would be gratefully received. The
issue stems for a set of long, similar literals in the form definition.
The grammer has a number of definitions for 'form' style including
"DOULBLE_THIN", "DOUBLE_THICK", "DOUBLE" and "DOUBLEUNDER". To
accommodate the long literal length allowing ANTLR to distinguish
between the literals i've upped k to 11, but I'm still getting
nondeterminism between rules FRAMESTYLETYPE and STYLETYPE. And when the
parser is run I get an exception in the generated nextToken() operation
in the lexer.
For example, the parser may be parsing a STYLE line and expects one of
the STYLETYPES to follow but due to the order of the rules in the
grammer nextToken() matches the DOUBLE (from DOUBLE_THIN) from the
FRAMESTYLETYPE first and throws an exception. Changing the rules order
isn't an option because it will throw an exception on the FRAMESTYLE
line instead.
I'm sure this is a problem because i've increased k but I'm at a loss at
any alternative strategies.
Cheers,
Andy
---------------------------------------------------------
Below is a simplified version of the grammer...
options
{
language = "CSharp";
}
class TestParser extends Parser;
form: (formbody)+ ;
formbody: BEGIN formcontent END ;
formcontent: (formentry)+ ;
formentry : styleline
| framestyleline;
styleline: STYLE style1:STYLETYPE;
framestyleline: FRAMESTYLE style:FRAMESTYLETYPE;
class TestLexer extends Lexer;
options
{
k = 11;
charVocabulary = '\3'..'\377';
charVocabulary = '\u0000'..'\uFFFE';
}
BEGIN: "BEGIN" ;
END: "END" ;
STYLE: "STYLE" ;
FRAMESTYLE: "FRAMESTYLE" ;
FRAMESTYLETYPE: "SINGLE_THIN" | "DOUBLE_THIN" | "SINGLE_THICK" |
"DOUBLE_THICK" | "DOTTED" ;
STYLETYPE: "NORMAL" | "BOLD" | "ITALIC" | "UNDER" |
"DOUBLEUNDER" | "DOUBLE" | "TRIPLE" | "QUADRUPLE" |
"STRIKETHROUGH" | "ROTATE90" | "ROTATE270" |
"UPSIDEDOWN" | "PROPORTIONAL" | "DOUBLEHIGH" |
"TRIPLEHIGH" | "QUADRUPLEHIGH" | "CONDENSED" |
"SUPERSCRIPT" | "OVERSCORE" | "LETTERQUALITY" |
"NEARLETTERQUALITY" | "DOUBLESTRIKE" | "OPAQUE"
;
WS : (' ' | '\t')+ { $setType(Token.SKIP); }
;
NEWLINE
: '\r' '\n' { newline(); $setType(Token.SKIP);}
| '\n' { newline(); $setType(Token.SKIP);}
| '\r' { newline(); $setType(Token.SKIP);}
;
------------------------------------------------
public new Token nextToken() //throws
TokenStreamException
{
...
default:
if ((LA(1)=='D'||LA(1)=='S')
&& (LA(2)=='I'||LA(2)=='O') && (LA(3)=='N'||LA(3)=='T'||LA(3)=='U') &&
(LA(4)=='B'||LA(4)=='G'||LA(4)=='T') && (LA(5)=='E'||LA(5)=='L') &&
(LA(6)=='D'||LA(6)=='E') && (true) && (true) && (true) && (true) &&
(true))
{
mFRAMESTYLETYPE(true);
theRetToken =
returnToken_;
}
else if
((tokenSet_0_.member(LA(1))) && (tokenSet_1_.member(LA(2))) &&
(tokenSet_2_.member(LA(3))) && (tokenSet_3_.member(LA(4))) && (true) &&
(true) && (true) && (true) && (true) && (true) && (true)) {
mSTYLETYPE(true);
theRetToken =
returnToken_;
}
else if ((LA(1)=='B') &&
(LA(2)=='E') && (LA(3)=='G')) {
mBEGIN(true);
theRetToken =
returnToken_;
}
else if ((LA(1)=='S') &&
(LA(2)=='T') && (LA(3)=='Y')) {
mSTYLE(true);
theRetToken =
returnToken_;
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20061124/c0245e6e/attachment-0001.html
More information about the antlr-interest
mailing list