[antlr-interest] Using a Parser as a TokenFilter
Chris Black
chris at lotuscat.com
Thu May 12 07:52:46 PDT 2005
Ric Klaren wrote:
>Hi,
>
>On 5/11/05, Chris Black <chris at lotuscat.com> wrote:
>
>
> [stuff deleted]
>
>>line:
>> (NEWLINE) => emptyLine
>> | ((FIELD | DELIM)+ NEWLINE) => contentLine
>> ;
>>
>>
>
>I'd get rid of these predicates. They serve no purpose. My rule of
>thumb: when you got a rule with every alternative guarded with a
>syntactic predicate then you're probably doing something wrong (you
>get rules that might not consume input and that's usually good for
>strange stuff).
>
>
>
Good point. I have changed it to just use | alternatives.
>The lookahead you're looking at is not ambiguous and in the case of
>erroneous input the rule might not consume anything. Also you're not
>handling EOF... e.g. in case of EOF you'll get an RecognitionException
>that gets eaten by nextToken (silently as well) (try adding a few
>println's to the bits that eat exceptions, for a filter I wrote a
>while back some of them needed extra handling).
>
>
>
Added some printlns in nextToken exception handlers, never printed
anything. probably not a bad idea to have them in there though.
>The following rule is a lot simpler. Try to differentiate between the
>things you want with some attributes in this filter (e.g. when you
>call the line rule from nextToken set an attribute that you started a
>line and reset a counter for the fields then update the counter in the
>closure below. You might also want to set a flag in the (NEWLINE|EOF)
>bit. So you can detect better how the line rule ended (and when the
>next line will start!). I also miss the code that inserts the marker
>at the start of the line.
>
>line:
> (FIELD | DELIM)* ( NEWLINE | EOF )
> ;
>
>Or something like:
>
>line:
>| FIELD (DELIM FIELD)* (DELIM)* eol
>| (DELIM)* eol
>;
>
>eol:( NEWLINE | EOF );
>
>
>
I switched to having the eol rule, this actually turned out to stop the
spurrious unexpected null token errors I was seeing from the downline
parser as well.
>Also first get things to work with the marking of the stanza's then
>add the comma eating. When a filter starts eating input it might eat
>all the input and that takes some extra handling if I recall right.
>It's probably a good idea to let the line rule finish in nextToken
>then check the tail of the queue for trailing comma's and nuke them
>from the queue.
>
>
>
Probably a better approach, before getting your reply I gave up on the
trailing DELIM eating for a bit and switched focus to adding imaginary
tokens. This solved my downline problems on its own. I may do further
work to kill the trailing delims to clean up the downstream parser
anyway in the future.
>Tip: Read the code generated for the line rule and get a feel for how
>it interacts with your consume & nextToken method. In this case it is
>also feasible to handcode the filter since it's not that complex
>parsing wise.
>
>I'm afraid I might not be too coherent/clear in this post but there
>should be some tips in it that might get you going again. I'll look
>again at it when I'm at home again.
>
>Cheers,
>
>Ric
>
>
I appreciate all the help! For the curious, here is what I have now:
header {
package mypackage;
import antlr.*;
}
class StanzaParser extends Parser;
options {
importVocab=CSV;
k=2;
}
tokens {
STANZASEPARATOR;
}
{
MyTokenQueue queue = new MyTokenQueue(8);
public void consume() {
try {
queue.append(LT(1));
} catch(TokenStreamException e) {
System.err.println("error in consume");
System.err.println(e);
e.printStackTrace();
}
super.consume();
}
public Token nextToken() throws TokenStreamException {
Token ret;
if(queue.length() <= 0) {
try {
line();
} catch(RecognitionException e) {
System.err.println("recog exception in nextToken");
System.err.println(e);
e.printStackTrace();
}
catch(TokenStreamException e) {
System.err.println("tokenstream exception in nextToken");
System.err.println(e);
e.printStackTrace();
}
}
if(queue.length() > 0) {
ret = queue.elementAt(0);
queue.removeFirst();
return ret;
}
System.out.println("no more queue, returning EOF");
return new Token(Token.EOF_TYPE,"end of file");
}
}
line: (emptyLine | contentLine | delim1stLine) ;
emptyLine: eol ;
delim1stLine: DELIM (FIELD | DELIM)+ eol ;
contentLine: firstTok:FIELD
{
String firstText = firstTok.getText();
if(firstText.startsWith("Data Type") ||
firstText.startsWith("DataType")
|| firstText.equals("Count") || firstText.equals("Result")) {
queue.append(new Token(STANZASEPARATOR,"stanza sep"));
}
}
(FIELD | DELIM)* eol
;
eol: (NEWLINE | EOF) ;
More information about the antlr-interest
mailing list