[antlr-interest] Patch for filter mode

Wincent Colaiuta win at wincent.com
Sat Jun 9 18:16:31 PDT 2007


Terence has previously written (<http://www.antlr.org/pipermail/ 
antlr-interest/2007-May/020942.html>):

> filter=true only works in the lexer. :) You should not really have  
> a parser in this case because you cannot really apply a grammatical  
> structure to the incomplete stream of tokens emanating from a lexer  
> that filters most stuff out.

And indeed the ANTLR book on page 119 documents the filter option as  
begin "lexer only"... I've found what appear to be two bugs which  
relate to this:

1. Even though this is documented as a lexer-only option, it has  
effect in the parser as well; the effects include automatically  
turning on backtracking in the parser and preventing all parser  
actions from running (they appear in the generated code but as far as  
I can tell no codepath ever reaches them)

2. "filter = true" doesn't work for lexer grammars which are declared  
in a separate file as "lexer grammars"

The following simple patch to src/org/Antlr/codegen/ 
CodeGenerator.java fixes the first problem. The second one isn't  
really of concern to me because I haven't yet had a need to use  
anything other than a combined lexer/parser grammar in a single file:

301c301,302
<                                                          
grammar.getOption("filter").equals("true");
---
 >                                                          
grammar.getOption("filter").equals("true") &&
 >                                                          
( grammar.type==Grammar.LEXER );

(Tabs eaten by my email client when I pasted in the diff, it seems).

Before signing off, some words of explanation as to why I want to be  
able to use a filtering lexer in combination with a parser:

I'm aware that the filter mode was intended to enable the creation of  
"fuzzy" lexers but I've also found it very useful for parsing things  
like wikitext or templating languages (PHP or any like it) where you  
have a large amount of free-form text (no special markup) studded  
with meaningful chunks of a more formal language (wikitext  
directives, PHP code sections etc). In this case you don't want to  
filter out and throw away the dross; you want to keep it.

Without filtering mode it is very hard to write a lexer for this kind  
of input, yet with filtering mode as it currently is implemented you  
can't really use a parser either. For one thing, backtracking gets  
turned on in the parser whether you want it or not, and much more  
crucially any actions which you define or @after blocks which you set- 
up will never be executed (although @init blocks will be); there may  
be other issues as well but those are the ones I'm aware of.

Conversely, trying to write the lexer without filtering mode turned  
on is fiendishly difficult. You might have lexer rules like this:

FOO: 'foo';
BAR: 'bar';
DEFAULT: .;

The DEFAULT rule is intended to serve as a catch-all for everything  
which doesn't get tokenized by the other rules, but because ANTLR  
builds a predictive lexer, input such as "fob" will cause an  
exception to be thrown even though you might want it to be recognized  
as run of DEFAULT tokens... syntactic predicates don't help in this  
situation, as they really only help to select alternatives an in the  
case of a rule like FOO there are no alternatives and you'll still  
get messages like "mismatched character 'b' expecting 'o'" emitted  
during lexing:

FOO: ('foo')=> 'foo' ;

So basically I wanted filter=true in the lexer and the parser to be  
normal. So I tried working around the problem by splitting my lexer  
and parser into two separate files, a lexer grammar and a parser  
grammar, with "filter=true" set only in the lexer grammar. It was in  
this way that I discovered the second problem mentioned above (that  
filter=true is broken for lexer grammars and only works in combined  
grammars).

I also explored trying to mimic the behaviour of filter=true in the  
lexer without actually turning filtering on, but there are some  
special things that filter=true does that cannot be emulated just by  
playing with lexer rules (strict ordering of rules, special  
backtracking behaviour in which exceptions are never thrown etc).

So anyway, I'm hoping you can see the justification for this usage  
case, and that you'll accept my small fix.

Cheers,
Wincent








More information about the antlr-interest mailing list