[antlr-interest] Lookahead problems - Bug in C++ Runtime?

Martin Probst mail at martin-probst.com
Fri Sep 17 04:03:25 PDT 2004


Hi,
I've done some work to track this down and it seems to be a bug in the
C++ runtime. I have created a very simple sample grammar and run it
through ANTLR one time as language="Cpp" and once as language="Java".
The result is that in java mode everything works fine and the parser is
LL(1). In C++ mode somehow two tokens are read from the lexer in
advance. 
See this trace, first, Java is alright:
=== snip ===
martin at perseus Parser $ echo {{{foo}}} | java -classpath
/usr/share/antlr/lib/antlr.jar:. TestMain
 > expr;  > lexer mLCURLY; c=={
 < lexer mLCURLY; c=={
LA(1)=={
  > enclosedExpr; LA(1)=={
   > expr;  > lexer mLCURLY; c=={
[...]
=== snap ===
Then C++:
=== snip ===
martin at perseus Parser $ echo {{{foo}}} | ./TestMain
 > expr; LA(1)== > lexer mLCURLY; c==123
 < lexer mLCURLY; c==123
 > lexer mLCURLY; c==123
 < lexer mLCURLY; c==123
{
  > enclosedExpr; LA(1)=={
   > expr; LA(1)== > lexer mLCURLY; c==123
 < lexer mLCURLY; c==102
{
=== snap ===
You can see that it's reading one more token than it should. This should
usually be no problem but when you want to (read: have to) change the
state of the lexer within special grammar rules things get broken
because the next token is already recognised in a wrong way. This way I
can't use ANTLR :-/

I'm attaching the grammar, a TestMain.java and a TestMain.cpp source.
Everything should be really straightforward.

mfg
Martin

Am Do, den 16.09.2004 schrieb Martin Probst um 13:29:
> Hello,
> I have a lookahead problem with my grammar. I have a parser which has k=1
> but it actually seems to be looking ahead further than it should. See this
> output of ANTLR with -traceParser -traceLexer:
> 
> In the state before these steps my parser has recognized a "dirAttribute".
> It looks ahead, finds a "=" and a '"' and then descends into a
> dirAttributeValue. That's expected and good.
> 
> === snip ===
> > dirAttributeValue; LA(1)== > lexer mNEXT; c==104
>   > lexer mQUOT_ATTR_CONTENT; c==104
>   < lexer mQUOT_ATTR_CONTENT; c==123
>  < lexer mNEXT; c==123
> "
>  > lexer mNEXT; c==123
>   > lexer mLCURLY; c==123
>   < lexer mLCURLY; c==32
>  < lexer mNEXT; c==32
>  > quotAttrValueContent; LA(1)==http://www.w3
>  < quotAttrValueContent; LA(1)== > lexer mNEXT; c==32
>   > lexer mWS; c==32
>   < lexer mWS; c==34
>  < lexer mNEXT; c==34
>  > lexer mNEXT; c==34
>   > lexer mSTRING_LITERAL; c==34
>    > lexer mQUOT; c==34
>    < lexer mQUOT; c==46
>    > lexer mQUOT; c==34
>    < lexer mQUOT; c==32
>   < lexer mSTRING_LITERAL; c==32
>  < lexer mNEXT; c==32
> {
>  > quotAttrValueContent; LA(1)=={
>   > attrCommonContent; LA(1)=={
>    > expr; LA(1)== > lexer mNEXT; c==32
>   > lexer mWS; c==32
>   < lexer mWS; c==125
>  < lexer mNEXT; c==125
>  > lexer mNEXT; c==125
>   > lexer mRCURLY; c==125
>   < lexer mRCURLY; c==32
>  < lexer mNEXT; c==32
> .org
> [ ca. 15 grammatical steps removed ]
>     > literal; LA(1)==.org
>      > stringLiteral; LA(1)==.org
>      < stringLiteral; LA(1)== > lexer mNEXT; c==32
>   > lexer mWS; c==32
>   < lexer mWS; c==47
>  < lexer mNEXT; c==47
>  > lexer mNEXT; c==47
>   > lexer mSLASH; c==47
>   < lexer mSLASH; c==49
>  < lexer mNEXT; c==49
> }
>     < literal; LA(1)==}
> 
> === snap ===
> 
> Now the rule for "attrCommonContent" states:
> attrCommonContent:
>   /* some more alts */
>   | LCURLY expr RCURLY
> The lookeahed of the RCURLY should by that be sufficient to exit the
> attrCommonContent rule. So why does the parser require more lookahead from
> the lexer when exiting stringLiteral?
> 
> The problem with that is that within dirAttributeValue the lexer has to
> throw tokens in a different manner than within the following expr rules.
> This means I have to switch the lexer to a different state (done with
> actions within {} in the grammar). I can't switch the state before the
> parser leaves the attrCommonContent section (that means, the statement has
> to be directly behind the RCURLY within that one). But at that point the
> parser has obviously already fetched more tokens behind the RCURLY which
> leads to errors.
> 
> My lexer has k=2 and the whole stuff uses C++ with the runtime and
> generator from antlr-2.7.4. Can anyone help me with this? Am I
> missunderstanding ANTLRs behaviour in general or is this a bug or what?
> 
> Thanks,
> Martin
> 
> 
>  
> Yahoo! Groups Links
> 
> 
> 
>  
> 
> 
> 
> 


 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TestMain.cpp
Type: text/x-c++src
Size: 670 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20040917/720897ea/TestMain.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TestMain.java
Type: text/x-java
Size: 280 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20040917/720897ea/TestMain-0001.bin
-------------- next part --------------
options {
	language = "Cpp";
}

class TestParser extends Parser;
options {
	k=1;
}

expr:
	(primaryExpr | enclosedExpr)*;
	
enclosedExpr:
	LCURLY expr RCURLY;

primaryExpr:
	FOO; 

class TestLexer extends Lexer;
options { k=1; }

LCURLY: "{";

RCURLY: "}";

FOO: "foo";


More information about the antlr-interest mailing list