[antlr-interest] How to specify ‘any non-control symbol’?

Tue Oct 28 06:02:31 PDT 2008

Hendrik Maryns schrieb:
> Hi,
> 
> I want to define a LABEL lexer rule which should match almost anything.
>  Let’s say any non-control Unicode symbol.  Antlr wouldn’t accept .* or
> .+.  I probably don’t want a closing brace in there since it is a
> lisp-like grammar, but even space would be fine (although it probably
> won’t occur), so I did ~(')')+ but that feels like a hack.  Can I use
> POSIX regex classes such as p{alphnum} or something of the like?
> 
> H.

Currently ANTLR doesn't support Unicode classes. The only workaround
would be to define manually all code points (manually means
semi-automatic via use of some existing table as starting point). You
should be aware that ANTLR doesn't accept code points above \uffff, so
you'd have to translate UTF-32 into UTF-16 surrogates.

BTW, while it at first seems to be good idea to this kind of
discrimination in the lexer, you get far better error messages if you
push the error checking into the parser. Doing so requires merely to
make the lexer discriminate the potential classes in the minimal way. If
you like I can send you a lexer of mine using this strategy for
comparison purposes.

Johannes
> 
>  
> ------------------------------------------------------------------------
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>