[antlr-interest] ANTLR4 synpred combination with (..)+ too greedy?

Tue Oct 30 07:36:36 PDT 2012

Thank you for the hint, Sam.

I moved the predicates 
from left 
   fragment ID1 : {getCharPositionInLine()<2}?   [a-zA-Z];
   fragment ID2 : {getCharPositionInLine()>=2}?  [a-zA-Z];
to right 
   fragment ID1 : [a-zA-Z] {getCharPositionInLine()<2}?  ;
   fragment ID2 : [a-zA-Z] {getCharPositionInLine()>=2}? ;

and my print results for input 'abcde' changed from
WORD1       [@-1,7:9='abc',<1>,2:0]
WORD2       [@-1,10:11='de',<2>,2:3]
 into -> 
WORD1       [@-1,7:8='ab',<1>,2:0]
WORD2       [@-1,9:11='cde',<2>,2:2]

Now the letter 'c' is part of WORD2 what I expected.

Claus-Dieter

-----Ursprüngliche Nachricht-----
Von: Sam Harwell [mailto:sam at tunnelvisionlabs.com] 
Gesendet: Dienstag, 30. Oktober 2012 14:25
An: cd.barth at t-online.de; antlr-interest at antlr.org
Betreff: RE: [antlr-interest] ANTLR4 synpred combination with (..)+ to
greedy?

For now, you can work around this by moving the predicates in ID1 and ID2 to
the right side instead of the left side of the [a-zA-Z] set. The predicates'
text can stay the same.

-----Original Message-----
From: Sam Harwell 
Sent: Tuesday, October 30, 2012 8:22 AM
To: 'cd.barth at t-online.de'; antlr-interest at antlr.org
Subject: RE: [antlr-interest] ANTLR4 synpred combination with (..)+ to
greedy?

For left*most* edge predicates (evaluated before any character of the token
is matched), the input index will be located where you expect it. For all
other predicates in the lexer, the input index will be located one character
to the left of where you are currently thinking because consume() is not
called before evaluating the predicate.

This behavior may change in the future, but that certainly explains the
behavior you're seeing.

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of cd.barth at t-online.de
Sent: Tuesday, October 30, 2012 4:16 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] ANTLR4 synpred combination with (..)+ too greedy?

Using the following grammar

lexer grammar MyLexer;
WORD1                : ID1+;
WORD2                : ID2+;   

fragment ID1 : {getCharPositionInLine()<2}?   [a-zA-Z];
fragment ID2 : {getCharPositionInLine()>=2}? [a-zA-Z];

WS : [ \t\r\n]+ -> skip ;

and looking at lexer tokens with 
for (Token token : lexer.getAllTokens()) {
                int idx = token.getType();
                tokenName = lexer.getTokenNames()[idx];
                System.out.format(" %-12s", tokenName);
                System.out.println(token);
}

for this two input lines
a cde
abcde

has printed the results
WORD1       [@-1,0:0='a',<1>,1:0]
WORD2       [@-1,2:4='cde',<2>,1:2]

WORD1       [@-1,7:9='abc',<1>,2:0]
WORD2       [@-1,10:11='de',<2>,2:3]

And now my question:
Why is letter c from the first line "a cde" part of WORD2
and in the next line               "abcde"  part of WORD1?

My sneaking suspicion is that in case of second line the ()+ construct from
ID1+ is to greedy and consumes one token too much. 

Claus-Dieter