[antlr-interest] First post, possible ANTLR 2.7.6 bug
Foolish Ewe
foolishewe at hotmail.com
Thu Oct 19 15:00:57 PDT 2006
Hello All:
I've just done the Antlr mailing list registration, I hope the moderator
will contact me
if for some reason this message is not approved.
Tihs message seeks to address a few questions about a project I'm working
on.
1) I think I may have encountered a bug in Antlr 2.7.6 (same version as in
the original posting)
(it may be just be that there is some code generation setting or lexer
option settting
that I'm overlooking, that would be a great help). Please consider the
following example
*******************BEGIN example.g********************************
//My play area for diagnosing strange ANTLR symptoms
//Version History: 1.0 WAM created
// WAM - Need to add some boilerplate to let Antlr generated files know that
they are part of the ZTestParser package
header{
package testing;
}
class P extends Parser;
// Parser options
options{
k = 2; // Token stream lookahead, remember ANTLR uses LL(k) parsing
}
// Antlr requires Terminals have names beginning with uppercase letters,
Nonterminals should use lowercase I guess
startRule
:
strval:NONEMPTYALPHASTRING nl:NEWLINE // breaks if the user types in "dun\n"
where \n is newline
;
/*
optionala1:optA b:B
|
optionala2:optA c:C
;
*/
optA
:
A
|
//empty
;
class L extends Lexer;
// Lexer options
options{
k=7; // lookahead (need 2 for new line)
charVocabulary='\u0000'..'\u007F'; // Only support printable ASCII
characters, don't try fancy unicode stuff
// case sensitivitity turned off
caseSensitiveLiterals=false;
caseSensitive=false;
}
NEWLINE
: '\r' '\n' {newline();} // DOS
| '\r' {newline();} // Macintosh
| '\n' {newline();} // UNIX
;
WHITESPACE : ' ' {$setType(Token.SKIP);} // space character
| '\t' {System.out.println("Found a tab"); tab();
$setType(Token.SKIP);};
MONTH: ("jan" | "feb" | "mar" | "apr" | "may" | "jun" | // MMM (month)
"jul" | "aug" | "sep" | "oct" | "nov" | "dec");
NONEMPTYALPHASTRING: ('a'..'z')+; // Probably needs testliteral check
********************END example.g********************************
For ease of replication, I have attached a Java source driver:
*******************Begin Main.java*********************************
package testing;
import java.io.*;
public class Main {
/**
* @param args
*/
public static void main(String[] args) {
try{
System.out.println("Enter a string for the test parser (note this is for
simple ANTLR test cases)");
L lexer = new L(new DataInputStream(System.in));
System.out.println("After lexer instantiated before Parser instantiation");
P parser = new P(lexer);
System.out.println("After Parser instantiation before StartRule");
parser.startRule();
System.out.println("After startRule: Done?");
} catch(Exception e) {
System.err.println("exception: "+e);
}
}
}
************************end
Main.java******************************************
Running the example gives the following trace (all tracing was enabled):
************************begin
trace*******************************************
Enter a string for the test parser (note this is for simple ANTLR test
cases)
After lexer instantiated before Parser instantiation
After Parser instantiation before StartRule
>startRule; dun
>lexer mMONTH; c==d
< lexer mMONTH; c==u
exception: line 1:2: expecting 'e', found 'u'
**********************end
trace**********************************************
I think this happens because of the code generation, which appears (on the
surface) to
check to see if LA(i) matches the ith letter of any MONTH alternative, 1 <=
i <= 3 via bitmaps.
However, this is just a heuristic, and not a sure test that the lexeme is
indeed a MONTH
alternative, so "d" matches the prefix of dec, while "un" matches the suffix
of "jun".
Please correct me if I'm wrong here, are there any suggested work arounds?
I'm reluctant to make separate rules for each alternative, because I've seen
the lexer
follow wrong alternatives in grammars when I've tried adding "helper" rules,
so I try
to stick to literals on the RHS of lexer grammars. If there is someway to
tell the lexer
that a lexer rule is not associated with a legitimate terminal, but just a
subrule to make
writing the scanner easier, I'd like to hear it.
2) On a side note a colleague and I discovered (over a nice cup of coffiee)
that left factoring
does solve a problem I unsuccessfully tried to submit to the list
ie--mail, so using a rule like:
startRule: opta:optA ( b:B {System.out.println("Got optA B");} | c:C
{System.out.println("Got optA C");});
gets around this (I'm still not clear on why a larger lookahead didn't
allow recognition of the
original grammar, is it inherent in LL(k) parsing theory or is it an
implementation limitation?).
If anyone could point me to a reference or give a quick explanation, I'd
appreciate it.
Regards:
Bill M. (Sorry but my last name is too unique and generates a lot of spam)
These opinions are my own and do not reflect in any way on my employer
From: "Foolish Ewe" <foolishewe at hotmail.com>
To: antlr-interest at antlr.org
CC: FoolishEwe at hotmail.com
Subject: Handling optional tokens as prefixes in productions
Date: Mon, 16 Oct 2006 21:31:19 +0000
Hello All:
I've tried Antlr for a small project over the last few days, and so far it
has been good.
However, I'm having a problem with an optional prefix that is specified in
the language
I'm parsing, and I don't know whether:
1) LL(k) parsers can handle such a construct (although I hope so, so that I
can use ANTLR)
2) If they can handle such a construct, what is the right way to pose the
grammar for them
Consider the following (simple) example using ANTLR pseudocode, if you need
the
complete ANTLR source for the example let me know and I'll try to post it.
.// Lexer rules, use case insensitive lexemes
A: 'A' | 'a';
B: 'B' | 'b';
C: 'C' | 'c';
// Usual White space and new line stuff were in the code but are omitted
here
Now my parser has the rules:
// Uses the optional prefix of A or nothing
startRule // Gets nondeterminism warning from ANTLR, even with k=7 parser
look ahead
:
opta1:optA b:B
|
opta2:optA c:C
;
// This is the parser's rule for the optional prefix
optA
:
a:A
|
.// empty
;
I'm using the windows version from the Source Forge Eclipse plugin at
http://sourceforge.net/project/showfiles.php?group_id=55477 (Antlr 2.5.76).
Is there a way to refactor the grammar to fix this?
Thanks for your help:
Bill M. (sorry my last name is unusual and generates a lot of spam)
Disclaimer: These opinions and questions are my own and are not those of my
employer.
__
_________________________________________________________________
Stay in touch with old friends and meet new ones with Windows Live Spaces
http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.aspx&mkt=en-us
More information about the antlr-interest
mailing list