[antlr-interest] Problem parsing unit symbols

Mark van Assem mark at cs.vu.nl
Fri Nov 6 04:18:58 PST 2009


Hi Jim,

> So either the lexer specs are incorrect, or the characters you pasted here are not in an encoding that matches what Java is looking for. Send them in UTF8 format. The UTF8 version of Ohm is 0xE2 0x84 0xA6 for instance. What encoding are you sending in? When you come to read input files, then you will need to tell the file stream what the file encoding is.

How can I accomplish this? E.g. notepad allows to save a file in UTF8, 
but how do I get the right character ecodings in? If I e.g. copy them 
from a website this won't work of course.

In your second mail you say that you "hacked ANTRLworks to to set UTF8 
encoding on file input rather than default and your example stuff 
works". This sounds like something that is useful for many people and me 
in particular. Can I somehow get this new version?

Many thanks,
Mark.

>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of Mark van Assem
>> Sent: Thursday, November 05, 2009 9:30 AM
>> To: antlr-interest at antlr.org
>> Subject: [antlr-interest] Problem parsing unit symbols
>>
>> Hello Antlers,
>>
>> I'm designing a lexer/parser for units of measure (e.g. meters,
>> seconds). In that process I'm trying to match symbols like Ω (Ohm) and
>> å
>> (angstrom).
>>
>> Below is the relevant part of the grammar -  the part that treats
>> symbols. The grammar checks out OK in ANTLRWorks, but I get a
>> EarlyExitException when I run it on a file that contains two lines with
>> on the first the Ohm sign and on the second the angstrom sign. The
>> behaviour is different in the interpreter: there the first line is
>> parsed OK, but for the second line a NoViableAltException is given.
>>
>> If I understand correctly an EarlyExitException means that a Expr(..)+
>> failed because there wasn't anything to match. The rules "file" and
>> "expr" thus seem the only suspects. However, they both seem right to me
>> and fiddling with them produces other errors.
>>
>> Any ideas anyone?
>>
>> Thanks,
>> Mark van Assem.
>>
>> -----------------------------------------------------------------------
>> --
>> grammar unitsymbols;
>>
>> file	:	(expr NEWLINE)+ ;
>>
>> expr 	:	symbol+;
>>
>> symbol	:	US;
>>
>> /* LEXER */
>>
>> WS	:	' ' {$channel=HIDDEN;} ;
>> NEWLINE:'\r'? '\n'  ;
>>
>> // unit symbols like Ohm
>> US
>> 	: OHM | ALP ;
>>
>> fragment OHM	:	'\u2126' | '\u03A9';	// Ohm symbol
>> fragment ALP	:	'\u0251' | '\u03B1';	// alpha
>> -----------------------------------------------------------------------
>> --
>>
>> input:
>>
>> Ω
>> å
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
>> email-address
> 
> 
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address


More information about the antlr-interest mailing list