[antlr-interest] Greedy Issues

Sun Jun 22 16:14:44 PDT 2008

Good day everyone,

I am having issues with the greedy option (I think).

My input is:

+t pt +t foo !

The +t indicates the writing system (in this case transliteration).  An
alternative is +s which is Hieroglyphic.  I would like to group together all
characters within each writing system.  So my preferred output is:

(SEQUENCE (LINE (PHRASE TRANSLITERATION p t) (PHRASE TRANSLITERATION f o
o)))

If I use the following code I get the results as expected:

options {

  language = Java;

  output = AST;

  ASTLabelType = CommonTree;

  backtrack = true;

  memoize = true;

}

mdc

  : line* EOF

    -> ^(SEQUENCE line*)

  ;

line

  : languageMarker seperator* endOfLine

    -> ^(LINE)

  | languageMarker span (languageMarker span)* endOfLine

    -> ^(LINE span+)

  ;

languageMarker

  : '+' 's' seperator { writingSystem = WritingSystem.HIEROGLYPHS; }

  | '+' 't' seperator { writingSystem = WritingSystem.TRANSLITERATION; }

  ;

endOfLine

  : '!' (seperator+ | EOF)

  | seperator* EOF

  ;

span

  : { writingSystem.isAlphabetic() }?=> text* seperator

    -> ^(PHRASE PHRASE[writingSystem.name()] text*)

  | { !writingSystem.isAlphabetic() }?=> //word //(seperator! word)*
seperator!

  ;

text options { greedy=false; }

  : (SPACE '+')=>{false}?=>.

  | .

  ;

seperator

  : SPACE

  | UNDERSCORE

  ;

affixSeperator

  : EQUAL 

  //| DASH

  ;

UNDERSCORE: '_';

EQUAL: '=';

DASH: '@';

SPACE: (' ' | '\t' | '\r'? '\n');

CHAR: 'A'..'Z' | 'a'..'z';

DIGIT: '0'..'9';

If I restore the line:

//| DASH

Then my output gets turned into:

(SEQUENCE (LINE (PHRASE TRANSLITERATION p t   + t   f o o)))

It seems as if the greedy modifier no longer operates when this completely
unrelated (and currently unused) alternation is enabled.

I am using ANTLR 3.1b1.  I cannot use ANTLR3.0 because use of the greedy
option causes sporadic null pointer exceptions during generation.

To give you some perspective this is a transcription language called MdC
used to transcribe ancient Egyptian texts.  I have actually spent all day
(approximately 6 hours) trying for the life of me to figure out how to parse
a sequence of characters without actually consuming the end tokens.
Effectively, what I want is a negative lookahead expression which is what I
am trying to do in "text".

For example:

+s dwa r:a-ra:Z1 +t dwa ra +s nfr !

Needs to be groups such as:

(SEQUENCE (LINE (WORD dwa r:a-ra:Z1) (PHRASE TRANSLITERATION dwa ra) (WORD
nfr)))

As you can see, I have disabled all of the WORD parsing in order to simplify
the example.  That I have no trouble with the WORDs since it is very clear
what composes a word.  In the case of transliteration I ultimately need to
suck up all characters until I reach 

' !' | ' +'

(each preceded with a space) Without actually consuming those tokens (they
need to be handled higher up the parse tree).

Thank you for your help,

Ted

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080622/2bb048fb/attachment-0001.html