[antlr-interest] How to specify ‘any non-control symbol’?

Hendrik Maryns qwizv9b02 at sneakemail.com
Tue Oct 28 08:23:04 PDT 2008


Johannes Luber jaluber-at-gmx.de |news.gmane.org| schreef:
> Hendrik Maryns schrieb:
>> Johannes Luber schreef:
>>> Hendrik Maryns schrieb:
>>>> Hi,
>>>>
>>>> I want to define a LABEL lexer rule which should match almost anything.
>>>>  Let’s say any non-control Unicode symbol.  Antlr wouldn’t accept .* or
>>>> .+.  I probably don’t want a closing brace in there since it is a
>>>> lisp-like grammar, but even space would be fine (although it probably
>>>> won’t occur), so I did ~(')')+ but that feels like a hack.  Can I use
>>>> POSIX regex classes such as p{alphnum} or something of the like?
>>> Currently ANTLR doesn't support Unicode classes. The only workaround
>>> would be to define manually all code points (manually means
>>> semi-automatic via use of some existing table as starting point). You
>>> should be aware that ANTLR doesn't accept code points above \uffff, so
>>> you'd have to translate UTF-32 into UTF-16 surrogates.
>> This is what it already seem to do internally, see the attached image
>> Antlrworks produced.
> 
> It looks to me as if the code handles merely UCS-2 and not UTF-16.
> Without seeing at least rule you used as input I can't be entirely sure,
> though.
>>> BTW, while it at first seems to be good idea to this kind of
>>> discrimination in the lexer, you get far better error messages if you
>>> push the error checking into the parser. Doing so requires merely to
>>> make the lexer discriminate the potential classes in the minimal way. If
>>> you like I can send you a lexer of mine using this strategy for
>>> comparison purposes.
>> I don’t understand this.   What do you mean by ‘this kind of
>> discriminations’
> 
> I mean checking the input in such way that no illegal character is used
> to create a token, but making the lexer bail out immediately. Then your
> error messages won't be able to say much about the context.

It took me a while to parse this, but I think I understand you now.
There remains the question how the parser is then supposed to find out
that the token is illegal.

>> and in which way am I putting it in the lexer and could
>> push it into the parser?  I am afraid I am too new in this area to
>> follow you here.
> 
> Let's assume that identifiers may not start with uppercase letters. Then
> the above mentioned method would be to define the rule as:
> 
> ident: LOWERCASE (LOWERCASE | UPPERCASE)*;
> 
> My proposal is to use:
> 
> ident: (LOWERCASE | UPPERCASE)+;
> 
> Then the parser can tell you that "identifier PascalCase starts with an
> uppercase character".

And how would it do that?  What would I have to specify in order for the
parser to check for that?

H.
-- 
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 257 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20081028/ba65b5ab/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 257 bytes
Desc: OpenPGP digital signature
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20081028/ba65b5ab/attachment-0001.bin 


More information about the antlr-interest mailing list