[antlr-interest] Fun with ANTLR3: mystery of the huge lexer

Sat Jun 30 20:16:21 PDT 2007

> 1) I'm intrigued by user-defined or -extensible grammars. Over on the
> Groovy list we touched on dynamic / user-specified grammars recently in
> the context of adding rule- or logic-based programming capabilities to
> the language (and / or its runtime / libraries), and I'm a somewhat
> familiar with Prolog's operator definition and user-specified grammar
> facilities.

Groovy, a language for the JVM, eh? Haven't heard of it before now.
There are sure a lot of new languages these days.

I'm trying to develop a new compiler architecture called Loyc which
will be an extensible multi-language compiler. No web site yet - I'm
only starting a parser now, after all. My idea is to take existing
.NET languages and allow new statements, operators, etc. to be added
to them by dynamically linking extensions with the compiler. So, I am
attempting to construct a lexer and parser in a manner that allows
these extensions.

My design will not attempt to allow any arbitrary grammar fragment to
be specified in the manner done by the Rats! parser generator, but it
will allow a variety of extension techniques. One of those techniques
will be to define "operators" for "expressions". The set of operators
will be defined entirely at run-time.

The design isn't complete but I guess I'll explain it a bit anyway.
The expression evaluation engine is called ONEP: One Nonterminal
Expression Parser. It takes a list of fixed-length operators (no
optional elements or lists) which you could visualize as a single
nonterminal like so:

expr:
    expr '*' expr
  | expr '/' expr
  | expr '+' expr
  | expr '-' expr
  | expr '?' expr ':' expr'
  | ID '.' ID
  | '-' expr
  | 'from' expr 'to' expr
  | expr expr
  | ID
  | INT
  | (...)

But each 'expr' has a precedence number to resolve order of
operations, and instead of a token type you can use a literal string
(which takes priority over another rule that only specifies a token
type). It detects ambiguities when an ambiguous expression is used.

There's certainly more to say but I think it best to wait until I have
some working code.

> 2) I don't know why you want to avoid lexer fragments. The way I think
> about them, and someone will, I hope, correct me if I'm wrong about
> this, they're like macros. They don't stand alone, but are "expanded"
> in-line in other lexer rules that reference them.

I'm not trying to avoid them; I just don't want to write

FOO: FOO_FRAGMENT;
fragment FOO_FRAGMENT: '[' (FOO_FRAGMENT | ~('['|']'))* ']';

if

fragment FOO: '[' (FOO | ~('['|']'))* ']';

would work just as well.

ANTLR lexer rules are not expanded inline, and they are not regular
expressions. They are generated as functions just like regular lexer
rules, and the main difference between fragment and non-fragment rules
is that fragment rules are never used unless called by another lexer
rule.
-- 
- David
http://qism.blogspot.com