[antlr-interest] How to specify ‘any non-control symbol’?

Tue Oct 28 07:23:18 PDT 2008

Johannes Luber schrieb:
> Hendrik Maryns schrieb:
>> Johannes Luber schreef:
>>> Hendrik Maryns schrieb:
>>>> Hi,
>>>>
>>>> I want to define a LABEL lexer rule which should match almost anything.
>>>>  Let’s say any non-control Unicode symbol.  Antlr wouldn’t accept .* or
>>>> .+.  I probably don’t want a closing brace in there since it is a
>>>> lisp-like grammar, but even space would be fine (although it probably
>>>> won’t occur), so I did ~(')')+ but that feels like a hack.  Can I use
>>>> POSIX regex classes such as p{alphnum} or something of the like?
>>> Currently ANTLR doesn't support Unicode classes. The only workaround
>>> would be to define manually all code points (manually means
>>> semi-automatic via use of some existing table as starting point). You
>>> should be aware that ANTLR doesn't accept code points above \uffff, so
>>> you'd have to translate UTF-32 into UTF-16 surrogates.
>> This is what it already seem to do internally, see the attached image
>> Antlrworks produced.
> 
> It looks to me as if the code handles merely UCS-2 and not UTF-16.
> Without seeing at least rule you used as input I can't be entirely sure,
> though.
>>> BTW, while it at first seems to be good idea to this kind of
>>> discrimination in the lexer, you get far better error messages if you
>>> push the error checking into the parser. Doing so requires merely to
>>> make the lexer discriminate the potential classes in the minimal way. If
>>> you like I can send you a lexer of mine using this strategy for
>>> comparison purposes.
>> I don’t understand this.   What do you mean by ‘this kind of
>> discriminations’
> 
> I mean checking the input in such way that no illegal character is used
> to create a token, but making the lexer bail out immediately. Then your
> error messages won't be able to say much about the context.
> 
>> and in which way am I putting it in the lexer and could
>> push it into the parser?  I am afraid I am too new in this area to
>> follow you here.
> 
> Let's assume that identifiers may not start with uppercase letters. Then
> the above mentioned method would be to define the rule as:
> 
> ident: LOWERCASE (LOWERCASE | UPPERCASE)*;
> 
> My proposal is to use:
> 
> ident: (LOWERCASE | UPPERCASE)+;
> 
> Then the parser can tell you that "identifier PascalCase starts with an
> uppercase character".
> 
> Johannes

I should amend the examples that one wouldn't create tokens for each
character but build a single token like:

IDENT: (LOWERCASE | UPPERCASE)+;

fragment LOWERCASE: 'a'..'z';

fragment UPPERCASE: 'A'..'Z';

Doesn't change the actual point.

Johannes
> 
>> H.
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>