[antlr-interest] Unicode classes within ANTLR.

Wed Oct 20 22:04:41 PDT 2010

Hello ANTLR folks,

I'm just curious about ANTLR since I'm writing my own variation of a
parser compiler.  My primary language of choice is C#, so to browse
the LL(*) based parser compiler competition, I read through the
grammar associated to C# 4.0 (
http://antlrcsharp.codeplex.com/SourceControl/changeset/view/58355#897956
) to get an idea of how they've written the grammar for the language.
One note is that the unicode scope observed by the grammar is far too
shallow to be a valid implementation of the language's parser.  So the
question I have today is: does ANTLR support Character Unicode
classes?  I viewed a secondary ECMA script grammar (
http://www.antlr.org/grammar/1153976512034/ecmascriptA3.g ) that
appeared to specify the ranges associated to a unicode letter, and it
specified the range in a very lengthy manner.

This concerns me because the resulted expanded form of the unicode was
quite difficult to follow.  Using a very simple '@' | ('A'..'Z')
variation (from the C# 4.0 grammar) ANTLR produces code similar to:
   ... input.LA(1) >= '@' && input.LA(1) <= 'Z' ...

I'm not familiar with ANTLR's generation logic; however, in the case
of the first-character of an identifier, there are roughly 47,727
different valid characters that can be used, which covers unicode
classes: Letter Uppercase, Letter lowercase, Letter title-case, Letter
modifier, Letter Other, Letter Number and a unicode escape sequence.
At which point does the ANTLR generator utilize a character
placeholder variable?  I've noticed it switching between the two, the
only correlation I can see is it uses a placeholder within loops
(which is then re-verified over a switch for some reason), and calls
LA(*) as many times as needed, to cover the character range supplied
from the grammar, outside of loops.  The character classes can easily
be translated into a sequence of range checks on the character, but
from my initial tests with my own project, I remember the size of the
code and sheer number of small sets, it slowed the code down due to
bloat.

To make a simple case, my project's far from done, so I'll just use a
static example generated from C#'s identifier:
http://lhq.rpgsource.net/text/CSharpReader/Lexer/IdentifierStateMachine.html

The state machine operates similar to an iterator as defined by the C#
2.0+ specification, with one difference: it requires the current
character as input.  The intent is to demonstrate the use of unicode
character classes as well as normal transitions.

Since my project is primarily focused on research, my aim here today
isn't to irritate, but rather to get insight as to how other systems
work, since there's a lot to learn.  I've been working on the program
I'm on (off and on) for roughly three years, and I've went from
knowing how to write a hand-written parser (poorly), when I started,
to having a rough idea on how to automate the entire process (from
lexer, to tree parser).  The way the lexer in ANTLR is implemented is
similar to how I initially implemented the lexical analysis phase
(although Terence probably got it right the first time, having a
formal background).  From what I can make of the setup, Terence uses a
minimalistic deterministic machine to differentiate the proper path of
a given parse, then it sends it down the proper rule/token method,
after which it attempts each subsequent variation of the rule, trying
things and backtracking until it finds the right path.

In my research to this, I can kind of understand this method, if I'm
right, he uses a deterministic advance look into the text/look ahead
to ascertain the path (since determinism is quick) once the paths
diverge enough to determine which to use, it follows that path.
Determinism is great for finding the quickest, longest, route (if
you're just verifying), but it's not very good for sub-expression
boundaries (which you need to extract the data from the parse you're
interested in, ignoring the punctuation like '(', 'class' and so on,
the important bits are that it's a class, what the class' name is, and
the variable aspects of the definition of that class.)

If unicode classes aren't supported, are they in the works for the
next version of ANTLR (v4)?

Thanks,

-Alexander Morou