[antlr-interest] How to specify ‘any non-control symbol’?
Johannes Luber
jaluber at gmx.de
Tue Oct 28 07:23:18 PDT 2008
Johannes Luber schrieb:
> Hendrik Maryns schrieb:
>> Johannes Luber schreef:
>>> Hendrik Maryns schrieb:
>>>> Hi,
>>>>
>>>> I want to define a LABEL lexer rule which should match almost anything.
>>>> Let’s say any non-control Unicode symbol. Antlr wouldn’t accept .* or
>>>> .+. I probably don’t want a closing brace in there since it is a
>>>> lisp-like grammar, but even space would be fine (although it probably
>>>> won’t occur), so I did ~(')')+ but that feels like a hack. Can I use
>>>> POSIX regex classes such as p{alphnum} or something of the like?
>>> Currently ANTLR doesn't support Unicode classes. The only workaround
>>> would be to define manually all code points (manually means
>>> semi-automatic via use of some existing table as starting point). You
>>> should be aware that ANTLR doesn't accept code points above \uffff, so
>>> you'd have to translate UTF-32 into UTF-16 surrogates.
>> This is what it already seem to do internally, see the attached image
>> Antlrworks produced.
>
> It looks to me as if the code handles merely UCS-2 and not UTF-16.
> Without seeing at least rule you used as input I can't be entirely sure,
> though.
>>> BTW, while it at first seems to be good idea to this kind of
>>> discrimination in the lexer, you get far better error messages if you
>>> push the error checking into the parser. Doing so requires merely to
>>> make the lexer discriminate the potential classes in the minimal way. If
>>> you like I can send you a lexer of mine using this strategy for
>>> comparison purposes.
>> I don’t understand this. What do you mean by ‘this kind of
>> discriminations’
>
> I mean checking the input in such way that no illegal character is used
> to create a token, but making the lexer bail out immediately. Then your
> error messages won't be able to say much about the context.
>
>> and in which way am I putting it in the lexer and could
>> push it into the parser? I am afraid I am too new in this area to
>> follow you here.
>
> Let's assume that identifiers may not start with uppercase letters. Then
> the above mentioned method would be to define the rule as:
>
> ident: LOWERCASE (LOWERCASE | UPPERCASE)*;
>
> My proposal is to use:
>
> ident: (LOWERCASE | UPPERCASE)+;
>
> Then the parser can tell you that "identifier PascalCase starts with an
> uppercase character".
>
> Johannes
I should amend the examples that one wouldn't create tokens for each
character but build a single token like:
IDENT: (LOWERCASE | UPPERCASE)+;
fragment LOWERCASE: 'a'..'z';
fragment UPPERCASE: 'A'..'Z';
Doesn't change the actual point.
Johannes
>
>> H.
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
More information about the antlr-interest
mailing list