[antlr-interest] How to specify ‘any non-control symbol’?

Thu Oct 30 08:10:32 PDT 2008

Hendrik Maryns schreef:
> Johannes Luber schreef:
>> Johannes Luber schrieb:
>>> Hendrik Maryns schrieb:
>>>> Johannes Luber schreef:
>>>>> Hendrik Maryns schrieb:
>>>>>> Hi,
>>>>>>
>>>>>> I want to define a LABEL lexer rule which should match almost anything.
>>>>>>  Let’s say any non-control Unicode symbol.  Antlr wouldn’t accept .* or
>>>>>> .+.  I probably don’t want a closing brace in there since it is a
>>>>>> lisp-like grammar, but even space would be fine (although it probably
>>>>>> won’t occur), so I did ~(')')+ but that feels like a hack.  Can I use
>>>>>> POSIX regex classes such as p{alphnum} or something of the like?
>>>>> Currently ANTLR doesn't support Unicode classes. The only workaround
>>>>> would be to define manually all code points (manually means
>>>>> semi-automatic via use of some existing table as starting point). You
>>>>> should be aware that ANTLR doesn't accept code points above \uffff, so
>>>>> you'd have to translate UTF-32 into UTF-16 surrogates.
>>>> This is what it already seem to do internally, see the attached image
>>>> Antlrworks produced.
>>> It looks to me as if the code handles merely UCS-2 and not UTF-16.
>>> Without seeing at least rule you used as input I can't be entirely sure,
>>> though.
>>>>> BTW, while it at first seems to be good idea to this kind of
>>>>> discrimination in the lexer, you get far better error messages if you
>>>>> push the error checking into the parser. Doing so requires merely to
>>>>> make the lexer discriminate the potential classes in the minimal way. If
>>>>> you like I can send you a lexer of mine using this strategy for
>>>>> comparison purposes.
>>>> I don’t understand this.   What do you mean by ‘this kind of
>>>> discriminations’
>>> I mean checking the input in such way that no illegal character is used
>>> to create a token, but making the lexer bail out immediately. Then your
>>> error messages won't be able to say much about the context.
>>>
>>>> and in which way am I putting it in the lexer and could
>>>> push it into the parser?  I am afraid I am too new in this area to
>>>> follow you here.
>>> Let's assume that identifiers may not start with uppercase letters. Then
>>> the above mentioned method would be to define the rule as:
>>>
>>> ident: LOWERCASE (LOWERCASE | UPPERCASE)*;
>>>
>>> My proposal is to use:
>>>
>>> ident: (LOWERCASE | UPPERCASE)+;
>>>
>>> Then the parser can tell you that "identifier PascalCase starts with an
>>> uppercase character".
>>>
>>> Johannes
>> I should amend the examples that one wouldn't create tokens for each
>> character but build a single token like:
>>
>> IDENT: (LOWERCASE | UPPERCASE)+;
>>
>> fragment LOWERCASE: 'a'..'z';
>>
>> fragment UPPERCASE: 'A'..'Z';
>>
>> Doesn't change the actual point.
> 
> I am afraid this is no viable solution for me.  Let me explain what I want:
> 
> label : labelHead VARIABLE LABEL ;
> 
> labelHead is a fixed group of possibilities, VARIABLE can be any
> identifier of letters and numbers.  This part is easy, but LABEL can be
> anything: from -- to 3.Sg.Acc to Einführung to боекомплект to
> 𐎠𐎿𐎶𐎠𐎴𐎶 to みだれる.

Have fun searching Wiktionary, they’re all there :-p

> I don’t feel like trying to accommodate all of these character classes
> by specifying \u ranges.

As a workaround, I have
// I hate to do this, but it seems I have to specify explicit unicode
ranges here
// () come between ' and *, leave out control characters, whitespace and ()
LABEL :
  ( '!'..'\''
  | '*'..'\uffff'
  )+ ;

which works with the inputs above, but I am afraid of modifier letters,
CJK spacing and stuff.

Why doesn’t

LABEL : ~(WHITESPACE | '(' | ')')+ ;

work?

(139): set complement is empty - (208): The following token definitions
can never be matched because prior tokens match the same input: LABEL

H.
-- 
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 257 bytes
Desc: OpenPGP digital signature
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20081030/7dfc496e/attachment.bin