[antlr-interest] Greedy Issues
Ted Young
tyoung1 at tx.rr.com
Sun Jun 22 16:14:44 PDT 2008
Good day everyone,
I am having issues with the greedy option (I think).
My input is:
+t pt +t foo !
The +t indicates the writing system (in this case transliteration). An
alternative is +s which is Hieroglyphic. I would like to group together all
characters within each writing system. So my preferred output is:
(SEQUENCE (LINE (PHRASE TRANSLITERATION p t) (PHRASE TRANSLITERATION f o
o)))
If I use the following code I get the results as expected:
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
memoize = true;
}
mdc
: line* EOF
-> ^(SEQUENCE line*)
;
line
: languageMarker seperator* endOfLine
-> ^(LINE)
| languageMarker span (languageMarker span)* endOfLine
-> ^(LINE span+)
;
languageMarker
: '+' 's' seperator { writingSystem = WritingSystem.HIEROGLYPHS; }
| '+' 't' seperator { writingSystem = WritingSystem.TRANSLITERATION; }
;
endOfLine
: '!' (seperator+ | EOF)
| seperator* EOF
;
span
: { writingSystem.isAlphabetic() }?=> text* seperator
-> ^(PHRASE PHRASE[writingSystem.name()] text*)
| { !writingSystem.isAlphabetic() }?=> //word //(seperator! word)*
seperator!
;
text options { greedy=false; }
: (SPACE '+')=>{false}?=>.
| .
;
seperator
: SPACE
| UNDERSCORE
;
affixSeperator
: EQUAL
//| DASH
;
UNDERSCORE: '_';
EQUAL: '=';
DASH: '@';
SPACE: (' ' | '\t' | '\r'? '\n');
CHAR: 'A'..'Z' | 'a'..'z';
DIGIT: '0'..'9';
If I restore the line:
//| DASH
Then my output gets turned into:
(SEQUENCE (LINE (PHRASE TRANSLITERATION p t + t f o o)))
It seems as if the greedy modifier no longer operates when this completely
unrelated (and currently unused) alternation is enabled.
I am using ANTLR 3.1b1. I cannot use ANTLR3.0 because use of the greedy
option causes sporadic null pointer exceptions during generation.
To give you some perspective this is a transcription language called MdC
used to transcribe ancient Egyptian texts. I have actually spent all day
(approximately 6 hours) trying for the life of me to figure out how to parse
a sequence of characters without actually consuming the end tokens.
Effectively, what I want is a negative lookahead expression which is what I
am trying to do in "text".
For example:
+s dwa r:a-ra:Z1 +t dwa ra +s nfr !
Needs to be groups such as:
(SEQUENCE (LINE (WORD dwa r:a-ra:Z1) (PHRASE TRANSLITERATION dwa ra) (WORD
nfr)))
As you can see, I have disabled all of the WORD parsing in order to simplify
the example. That I have no trouble with the WORDs since it is very clear
what composes a word. In the case of transliteration I ultimately need to
suck up all characters until I reach
' !' | ' +'
(each preceded with a space) Without actually consuming those tokens (they
need to be handled higher up the parse tree).
Thank you for your help,
Ted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080622/2bb048fb/attachment-0001.html
More information about the antlr-interest
mailing list