[antlr-interest] NQOT: Grammar meta-programming

Sat Dec 8 03:35:04 PST 2007

Harald Mueller wrote:
> Steve wrote:
>> I have a vague
>> dream of one day being able to whip up an interpreter for an esoteric
>> language in an hour or less...
> 
> All the so-called "simple expressions" in languages I had to parse in the last 15 years were different - not in intent, but because many "language designers" have no idea about what they do and hence introduce subtle or not so subtle "impossibilities" into their languages. Not even C and C# have the same "simple expressions" - e.g.,
> 
>    (T) -a
> 
> might be a cast expression in C, but never in C#, because they changed the language only a very tiny bit to remove the old "lexical hack." (I hope I did not mess up that example - I did not look intot the reference when I wrote it ...).

Actually, this is wrong.

"Cast expressions:

A cast_expression is used to explicitly convert an expression to a given
type.

cast_expression
   :    '( ' type ')'unary_expression
   ;

A cast_expression of the form (T)E, where T is a type and E is a
unary_expression, performs an explicit conversion (§6.2) of the value of
E to type T. If no explicit conversion exists from E to T, a
compile-time error occurs. Otherwise, the result is the value produced
by the explicit conversion. The result is always classified as a value,
even if E denotes a variable.

The grammar for a cast_expression leads to certain syntactic
ambiguities. For example, the expression (x)–y could either be
interpreted as a cast_expression (a cast of –y to type x) or as an
additive_expression combined with a parenthesized_expression (which
computes the value x – y).

To resolve cast_expression ambiguities, the following rule exists: A
sequence of one or more tokens (§2.3.3) enclosed in parentheses is
considered the start of a cast_expression only if at least one of the
following are true:

• The sequence of tokens is correct grammar for a type, but not for an
expression.

• The sequence of tokens is correct grammar for a type, and the token
immediately following the closing parentheses is the token “~”, the
token “!”, the token “(”, an identifier (§2.4.1), a literal (§2.4.4), or
any keyword (§2.4.3) except as and is.

The term “correct grammar” above means only that the sequence of tokens
must conform to the particular grammatical production. It specifically
does not consider the actual meaning of any constituent identifiers. For
example, if x and y are identifiers, then x.y is correct grammar for a
type, even if x.y doesn’t actually denote a type.

>From the disambiguation rule it follows that, if x and y are
identifiers, (x)y, (x)(y), and (x)(-y) are cast_expressions, but (x)-y
is not, even if x identifies a type. However, if x is a keyword that
identifies a predefined type (such as int), then all four forms are
cast_expressions (because such a keyword could not possibly be an
expression by itself)."

At least this doesn't require a lexical hack.

> Last (but not least??), writing the expression part of a language is often a good "entry point" into the first unit tests that have(!!!!!) to be written for any grammar; and I think what you save up front will haunt you later when you get to the intricacies of a language ...

How would you do unit tests with ANTLR? I'm not sure how I can realize
them here.

Johannes