[antlr-interest] howto ignore unknown tokenstreams/recordsets

Mon Jan 17 11:52:53 PST 2005

On Mon, 17 Jan 2005 19:44:13 +0100, Oliver Kowalke
<oliver.kowalke at gmx.de> wrote:
> Am Montag, 17. Januar 2005 15:59 schrieb Ric Klaren:
> >> I'm writing a parser which parses a dokument (datasets separated by
> >> semicolon).
> >> How can I ignore unknown datasets (in my example recordsets X and Y)?
> >
> >It depends a bit on what defines an unknown recordset. If you know
> >those start with X and Y then you can just skip them in the lexer like
> >whitespace. Or you can use tokenstream filtering between lexer and
> >parser. Although you should take care that there's no (or very
> >controlled) feedback from parser to lexer (when using tokenstream
> >filtering). Another approach might be to make some custom error
> >handlers in your parser that skip the unrecognized bits, that might
> >interfere with normal error handling though.
> 
> For a unknown recordset the leading key (X or Y in my example, in general
> <something else>) are not known. The  structure of the to be ignored
> recordsets is <some letters> ( ~(";") )+ ";".
> Because I don't know <some letters> and the following tokens until ";" I can
> not skip them in the lexer. (right ?)

Yup. 

But I get the impression that you can tokenize the unknown records? If
that's the case then make a rule in the parser that works as a collect
all. You'll get ambiguity warnings but the first matching alternative
will get matched so things should be ok.

Ok had a closer look at the lexer/parser you posted earlier. It looks
to me that the lexer is not 100% functional. At least I don't see how
it could match the A-D tokens. Without a rule with testLiterals =true;

Try adding a lexer rule :

ID options { testLiterals = true; }:     ( 'A' .. 'Z' )+ ;

This one matches everything that consists of only letters. The
testLiterals option makes sure the items added in the tokens section
get recognized as such (before returning from the ID rule antlr checks
against entries in the tokens table). E.g. they get passed to the
parser as A .. D and the unknown tokens get passed as ID. You could
use that to make the catch all rule. At least that should be the
general idea I think.

Cheers,

Ric