[antlr-interest] Reg LR-style expression parsing in wiki

Mon Feb 22 14:00:11 PST 2010

> That is interesting...can the interpreter fetch/exec loop fit in L1 cache?  maybe...

Things always get more complicated by the time you have a working product,
but I had these wild-ass guesses in my notes:

    Back of envelope calculation: 2KB for basic parser framework,
    suppose average of 3 rules per non-terminal, average of 5
    non-terminal lookups per non-terminal (each with an associated
    2-byte "address" to "call"), average of 5 symbols
    per rule, summing to 3 * (10*2 + 5*2) = 90 bytes per non-terminal.
    Let's round that up to 100 bytes per non-terminal to allow for
    action/reduce opcodes, tail recursion optimization opcodes, etc.

    Say about 60 non-terminals for ANSI C parser. That would come
    to only 6KB+2KB=8KB! Leaves a fair amount of room for user actions,
    parser stack, value stack, etc. especially if you're just constructing
    a parse tree. Even double that to be more pessimistic and 16KB to
    parse ANSI C sounds pretty small, leaving 3/4 of L1 cache on many
    modern Intel processors (and of course, if the entire parser fits in
    16KB, something less than 100% of it will be "hot" enough to be
    regularly demanded in cache; e.g., "struct" tends to be needed
    during the first portion of a C source file, then later maybe not
    at all).

I do intend to make each opcode do as much as possible, and the fact
that I can boil a large expression grammar down to just a few rules helps
as well. For C code generation, for example, one opcode would scan
the list of eligible next terminals (simple list scans instead of sparse table
lookups or compressed table schemes). For Java code generation,
it appears likely just as well to replace the list-scanning opcode with
one switch statement per non-terminal, given the lack of static array
initialization and the fact that two of the highest-level Java VM opcodes
are devoted to switch statement support.

> i was thinking SLR(0) also instead of the prec climbing
> thing for expressions. I'll think about this hard.

So far, I have not found it objectionable to require no null rules
in the expression part of the grammar; you can always wrap
that sub-grammar with another nonterminal that is nullable.
I also have not found any popular language expression construct
whose ambiguous SLR(0) automaton can't correctly be
disambiguated by the classic operator precedence table algorithm,
at least on my whiteboard. I guess the comma operator in C
requires a little massaging because it's ambiguous with function
parameter separation, but that's true in an LALR grammar for
ANSI C as well.