[antlr-interest] Problem parsing unit symbols

Thu Nov 5 11:18:40 PST 2009

Mark van Assem wrote:
> Hello Antlers,
> 
> I'm designing a lexer/parser for units of measure (e.g. meters, 
> seconds). In that process I'm trying to match symbols like Ω (Ohm) and å 
> (angstrom).

The Ångstrom symbol is capital-A-ring (\u00C5 or \u212B), by the way.

> Below is the relevant part of the grammar -  the part that treats 
> symbols. The grammar checks out OK in ANTLRWorks, but I get a 
> EarlyExitException when I run it on a file that contains two lines with 
> on the first the Ohm sign and on the second the angstrom sign. The 
> behaviour is different in the interpreter: there the first line is 
> parsed OK, but for the second line a NoViableAltException is given.

The grammar includes alpha, not the Ångstrom symbol, so that explains
the interpreter behaviour. The behaviour when run on a file is likely
to be a character encoding issue; make sure that the charset parameter
to ANTLRInputStream matches the encoding of your file (probably UTF-8).
Also, either make sure that the file does not contain an initial BOM
(Byte Order Mark, \uFFEF), or match that character in your grammar.

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 292 bytes
Desc: OpenPGP digital signature
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20091105/7e6d3c63/attachment.bin