[antlr-interest] C# parser grammar problem

Wed Mar 7 05:39:00 PST 2007

Terence Parr wrote:
> Given
> 
> java.lang.StringIndexOutOfBoundsException: String index out of range: 7
> 
> Oh, when I debug, it says literal='\u'
> 
> So, here is your problem:
> 
> fragment unicode_escape_sequence[string unicodeClasses]
>         :       '\u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>         |       '\U' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
> HEX_DIGIT HEX_DIGIT HEX_DIGIT
>         ;
> 
> :)  You want 'u' and 'U'.
> 
> Ter
> 

Actually, I want '\\u' and '\\U', because that rule is supposed to
recognize strings like "\u0029" and "\U0000FF54". I suppose you want to
enhance ANTLR that the error message is more concise, so one would know
to look for. Another reason why I wrote only one backslash is, that the
ANTLR PDF doesn't specify, that all usual escape sequences are
recognized, including '\\' and '\''.

Furthermore I've read through the relevant pages 85 and 86, but there is
no explicit sentence, which discriminates between lexer and parser
rules. After I repaired my grammar I realized that the difference is
simple. "If a rule doesn't include references to other rules, it is a
lexer rule, otherwise a parser rule." It would have helped to lessen my
learning curve.

Regarding my grammar, I've noticed that ANTLRworks complains that it the
rules would recursive, thus allowing several ways to reach the
UNICODE_CLASS_Lt rule (and other similar rules), which constitutes a
part of an identifier. The suggested solution is to allow backtracking.
As I don't suppose that I can get rid of the recursion without changing
the recognized language, how much impact has that option?

On another note, testing the full grammar rules for my Unicode character
classes recognizer* revealed that ANTLR doesn't handle valid characters
like '\u1D173', marking the MUSICAL SYMBOL BEGIN BEAM, as they have 5
hexdigits. I suppose that the Unicode handling of C# is the way to go,
but then ANTLR itself has to be modified. The question is how Java
handles Unicode characters above U+FFFF.

Best regards,
Johannes Luber

* I plan to publish the files for the recognizer once the issues have
been resolved.