[antlr-interest] How to specify ‘any non-control symbol’?

Thu Oct 30 07:46:25 PDT 2008

Johannes Luber schreef:
> Johannes Luber schrieb:
>> Hendrik Maryns schrieb:
>>> Johannes Luber schreef:
>>>> Hendrik Maryns schrieb:
>>>>> Hi,
>>>>>
>>>>> I want to define a LABEL lexer rule which should match almost anything.
>>>>>  Let’s say any non-control Unicode symbol.  Antlr wouldn’t accept .* or
>>>>> .+.  I probably don’t want a closing brace in there since it is a
>>>>> lisp-like grammar, but even space would be fine (although it probably
>>>>> won’t occur), so I did ~(')')+ but that feels like a hack.  Can I use
>>>>> POSIX regex classes such as p{alphnum} or something of the like?
>>>> Currently ANTLR doesn't support Unicode classes. The only workaround
>>>> would be to define manually all code points (manually means
>>>> semi-automatic via use of some existing table as starting point). You
>>>> should be aware that ANTLR doesn't accept code points above \uffff, so
>>>> you'd have to translate UTF-32 into UTF-16 surrogates.
>>> This is what it already seem to do internally, see the attached image
>>> Antlrworks produced.
>> It looks to me as if the code handles merely UCS-2 and not UTF-16.
>> Without seeing at least rule you used as input I can't be entirely sure,
>> though.
>>>> BTW, while it at first seems to be good idea to this kind of
>>>> discrimination in the lexer, you get far better error messages if you
>>>> push the error checking into the parser. Doing so requires merely to
>>>> make the lexer discriminate the potential classes in the minimal way. If
>>>> you like I can send you a lexer of mine using this strategy for
>>>> comparison purposes.
>>> I don’t understand this.   What do you mean by ‘this kind of
>>> discriminations’
>> I mean checking the input in such way that no illegal character is used
>> to create a token, but making the lexer bail out immediately. Then your
>> error messages won't be able to say much about the context.
>>
>>> and in which way am I putting it in the lexer and could
>>> push it into the parser?  I am afraid I am too new in this area to
>>> follow you here.
>> Let's assume that identifiers may not start with uppercase letters. Then
>> the above mentioned method would be to define the rule as:
>>
>> ident: LOWERCASE (LOWERCASE | UPPERCASE)*;
>>
>> My proposal is to use:
>>
>> ident: (LOWERCASE | UPPERCASE)+;
>>
>> Then the parser can tell you that "identifier PascalCase starts with an
>> uppercase character".
>>
>> Johannes
> 
> I should amend the examples that one wouldn't create tokens for each
> character but build a single token like:
> 
> IDENT: (LOWERCASE | UPPERCASE)+;
> 
> fragment LOWERCASE: 'a'..'z';
> 
> fragment UPPERCASE: 'A'..'Z';
> 
> Doesn't change the actual point.

I am afraid this is no viable solution for me.  Let me explain what I want:

label : labelHead VARIABLE LABEL ;

labelHead is a fixed group of possibilities, VARIABLE can be any
identifier of letters and numbers.  This part is easy, but LABEL can be
anything: from -- to 3.Sg.Acc to Einführung to боекомплект to
𐎠𐎿𐎶𐎠𐎴𐎶 to みだれる.

I don’t feel like trying to accommodate all of these character classes
by specifying \u ranges.

I see that probably I should make VARIABLE a parser rule which delegates
to a lexer rule IDENTIFIER or something, and indeed do some checking on
that like you suggest, but really my problem is LABEL.

H.
-- 
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 257 bytes
Desc: OpenPGP digital signature
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20081030/2333f44d/attachment.bin