[antlr-interest] Patch for filter mode
Wincent Colaiuta
win at wincent.com
Sat Jun 9 18:16:31 PDT 2007
Terence has previously written (<http://www.antlr.org/pipermail/
antlr-interest/2007-May/020942.html>):
> filter=true only works in the lexer. :) You should not really have
> a parser in this case because you cannot really apply a grammatical
> structure to the incomplete stream of tokens emanating from a lexer
> that filters most stuff out.
And indeed the ANTLR book on page 119 documents the filter option as
begin "lexer only"... I've found what appear to be two bugs which
relate to this:
1. Even though this is documented as a lexer-only option, it has
effect in the parser as well; the effects include automatically
turning on backtracking in the parser and preventing all parser
actions from running (they appear in the generated code but as far as
I can tell no codepath ever reaches them)
2. "filter = true" doesn't work for lexer grammars which are declared
in a separate file as "lexer grammars"
The following simple patch to src/org/Antlr/codegen/
CodeGenerator.java fixes the first problem. The second one isn't
really of concern to me because I haven't yet had a need to use
anything other than a combined lexer/parser grammar in a single file:
301c301,302
<
grammar.getOption("filter").equals("true");
---
>
grammar.getOption("filter").equals("true") &&
>
( grammar.type==Grammar.LEXER );
(Tabs eaten by my email client when I pasted in the diff, it seems).
Before signing off, some words of explanation as to why I want to be
able to use a filtering lexer in combination with a parser:
I'm aware that the filter mode was intended to enable the creation of
"fuzzy" lexers but I've also found it very useful for parsing things
like wikitext or templating languages (PHP or any like it) where you
have a large amount of free-form text (no special markup) studded
with meaningful chunks of a more formal language (wikitext
directives, PHP code sections etc). In this case you don't want to
filter out and throw away the dross; you want to keep it.
Without filtering mode it is very hard to write a lexer for this kind
of input, yet with filtering mode as it currently is implemented you
can't really use a parser either. For one thing, backtracking gets
turned on in the parser whether you want it or not, and much more
crucially any actions which you define or @after blocks which you set-
up will never be executed (although @init blocks will be); there may
be other issues as well but those are the ones I'm aware of.
Conversely, trying to write the lexer without filtering mode turned
on is fiendishly difficult. You might have lexer rules like this:
FOO: 'foo';
BAR: 'bar';
DEFAULT: .;
The DEFAULT rule is intended to serve as a catch-all for everything
which doesn't get tokenized by the other rules, but because ANTLR
builds a predictive lexer, input such as "fob" will cause an
exception to be thrown even though you might want it to be recognized
as run of DEFAULT tokens... syntactic predicates don't help in this
situation, as they really only help to select alternatives an in the
case of a rule like FOO there are no alternatives and you'll still
get messages like "mismatched character 'b' expecting 'o'" emitted
during lexing:
FOO: ('foo')=> 'foo' ;
So basically I wanted filter=true in the lexer and the parser to be
normal. So I tried working around the problem by splitting my lexer
and parser into two separate files, a lexer grammar and a parser
grammar, with "filter=true" set only in the lexer grammar. It was in
this way that I discovered the second problem mentioned above (that
filter=true is broken for lexer grammars and only works in combined
grammars).
I also explored trying to mimic the behaviour of filter=true in the
lexer without actually turning filtering on, but there are some
special things that filter=true does that cannot be emulated just by
playing with lexer rules (strict ordering of rules, special
backtracking behaviour in which exceptions are never thrown etc).
So anyway, I'm hoping you can see the justification for this usage
case, and that you'll accept my small fix.
Cheers,
Wincent
More information about the antlr-interest
mailing list