[antlr-interest] ANTLR4 synpred combination with (..)+ to greedy?

Tue Oct 30 06:24:45 PDT 2012

For now, you can work around this by moving the predicates in ID1 and ID2 to the right side instead of the left side of the [a-zA-Z] set. The predicates' text can stay the same.

-----Original Message-----
From: Sam Harwell 
Sent: Tuesday, October 30, 2012 8:22 AM
To: 'cd.barth at t-online.de'; antlr-interest at antlr.org
Subject: RE: [antlr-interest] ANTLR4 synpred combination with (..)+ to greedy?

For left*most* edge predicates (evaluated before any character of the token is matched), the input index will be located where you expect it. For all other predicates in the lexer, the input index will be located one character to the left of where you are currently thinking because consume() is not called before evaluating the predicate.

This behavior may change in the future, but that certainly explains the behavior you're seeing.

-----Original Message-----
From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of cd.barth at t-online.de
Sent: Tuesday, October 30, 2012 4:16 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] ANTLR4 synpred combination with (..)+ to greedy?

Using the following grammar

lexer grammar MyLexer;

WORD1                : ID1+;

WORD2                : ID2+;   

fragment ID1 : {getCharPositionInLine()<2}?   [a-zA-Z];

fragment ID2 : {getCharPositionInLine()>=2}? [a-zA-Z];

WS : [ \t\r\n]+ -> skip ;

and looking at lexer tokens with 

for (Token token : lexer.getAllTokens()) {

                int idx = token.getType();

                tokenName = lexer.getTokenNames()[idx];

                System.out.format(" %-12s", tokenName);

                System.out.println(token);

}

for this two input lines

a cde

abcde

has printed the results

WORD1       [@-1,0:0='a',<1>,1:0]

WORD2       [@-1,2:4='cde',<2>,1:2]

WORD1       [@-1,7:9='abc',<1>,2:0]

WORD2       [@-1,10:11='de',<2>,2:3]

And now my question:

Why is letter c from the first line "a cde" part of WORD2

and in the next line                      "abcde"  part of WORD1?

My sneaking suspicion is that in case of second line the ()+ construct from
ID1+ is to greedy and consumes one token

to much. 

Claus-Dieter

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address