[antlr-interest] How to specify ‘any non-control symbol’?

Tue Oct 28 07:18:38 PDT 2008

Hendrik Maryns schrieb:
> Johannes Luber schreef:
>> Hendrik Maryns schrieb:
>>> Hi,
>>>
>>> I want to define a LABEL lexer rule which should match almost anything.
>>>  Let’s say any non-control Unicode symbol.  Antlr wouldn’t accept .* or
>>> .+.  I probably don’t want a closing brace in there since it is a
>>> lisp-like grammar, but even space would be fine (although it probably
>>> won’t occur), so I did ~(')')+ but that feels like a hack.  Can I use
>>> POSIX regex classes such as p{alphnum} or something of the like?
>> Currently ANTLR doesn't support Unicode classes. The only workaround
>> would be to define manually all code points (manually means
>> semi-automatic via use of some existing table as starting point). You
>> should be aware that ANTLR doesn't accept code points above \uffff, so
>> you'd have to translate UTF-32 into UTF-16 surrogates.
> 
> This is what it already seem to do internally, see the attached image
> Antlrworks produced.

It looks to me as if the code handles merely UCS-2 and not UTF-16.
Without seeing at least rule you used as input I can't be entirely sure,
though.
> 
>> BTW, while it at first seems to be good idea to this kind of
>> discrimination in the lexer, you get far better error messages if you
>> push the error checking into the parser. Doing so requires merely to
>> make the lexer discriminate the potential classes in the minimal way. If
>> you like I can send you a lexer of mine using this strategy for
>> comparison purposes.
> 
> I don’t understand this.   What do you mean by ‘this kind of
> discriminations’

I mean checking the input in such way that no illegal character is used
to create a token, but making the lexer bail out immediately. Then your
error messages won't be able to say much about the context.

> and in which way am I putting it in the lexer and could
> push it into the parser?  I am afraid I am too new in this area to
> follow you here.

Let's assume that identifiers may not start with uppercase letters. Then
the above mentioned method would be to define the rule as:

ident: LOWERCASE (LOWERCASE | UPPERCASE)*;

My proposal is to use:

ident: (LOWERCASE | UPPERCASE)+;

Then the parser can tell you that "identifier PascalCase starts with an
uppercase character".

Johannes

> 
> H.
> 
> 
> ------------------------------------------------------------------------
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>