[antlr-interest] Best way to handle a large number of language constants?

Mon Mar 14 14:34:02 PDT 2011

Justin,

Use gperf to generate a perfect hash of your tokens. Use a simple rule to
match anything, then a custom dictionary for the lexer that uses your
pre-known token numbers. Look up in perfect hash and change the token to
the returned value. I have done this a number of times when there is a
large number of fixed keywords and it is easy to maintain. Perhaps you can
do everything you need with just gperf to be honest.

// Pick up the token definitions as assigned by ANTLR
//
#include "MySqlLexer.h"

// Certain keywords, as well as being part of the keyword set are also
// reserved words that cannot normally be used as identifiers the lexer
// defines the value IS_RESERVED for us to use to indicate this.
//

%}
%struct-type
%ignore-case
%language=ANSI-C
%define hash-function-name getKeyword
%define lookup-function-name getInWordSet
%7bit
%compare-lengths
%readonly-tables
%switch=1
%omit-struct-type
mySqlKeywordTok;
%%
# --------------
# Reserved words
#
# Reserved words are used exclusively to specify syntactical
# constructs in SQL and may not be used as identifiers.
#
ADD, KADD
| IS_RESERVED
ALL, ALL                    				| IS_RESERVED
ALTER, ALTER                  				| IS_RESERVED
ANALYZE, ANALYZE                			| IS_RESERVED
AND, AND                    				| IS_RESERVED

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Justin Murray
> Sent: Monday, March 14, 2011 9:00 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Best way to handle a large number of language
> constants?
>
> Hi All,
>
> I am working on a proprietary language of ours that is reminiscent of
> BASIC in some ways, but has morphed over the years into its own
> monstrosity. This language is primarily used to command our hardware
> devices. Our system has a large number of "parameters" (762 to be
> precise) that define the complex configuration of the hardware. This
> configuration mainly lives in a file that is read and sent down to the
> hardware (where it is stored as a simple array of values), but there is
> also the desire to edit these parameters programmatically at runtime.
> Each parameter has a name, and a numeric value. The desire is for each
> parameter to be read/written through simple assignment statements. For
> example, "AxisType.X = 0" assigns the value 0 to the AxisType parameter
> on the X axis. This is currently implemented in a seemingly terrible
> way, and I am looking for the best way to improve it.
>
> The current implementation involves providing a #include file that
> #defines each parameter as an array with a hard-coded index. This
> include file is handled by the pre-processor so that the syntax in
> question only has to handle the hard-coded array. The pre-processor is
> not too terribly inefficient, but the problem is that we have to
> distribute this enormous include file, and the users must remember to
> include it.
>
> I can imagine a couple of other ways to implement this, but I am not
> sure what way would be the most efficient. One way would be to add
> every parameter name as a keyword in the lexer. This has the benefit of
> relying on ANTLR to do all of the lexing for me, so that I don't have
> to parse any strings later in my own code. The problem is that this
> requires a lot of custom code in the grammar file (each token must have
> a well defined numeric index associated with it, to match the index
> used internally in the arrays). Additionally, I don't know how well
> ANTLR will handle having so many hundreds of additional tokens in the
> language. The good thing is that I could auto-generate the grammar from
> our definition of the parameters (in XML format).
>
> Alternatively, I could add a very generic rule to the lexer that would
> match any potentially valid parameter name, and wait until the semantic
> actions to validate this as an actual parameter or a syntax error.
> While this allows for a much simpler grammar on the ANTLR end, what I
> don't like about this that I then have to write a bunch of C code that
> essentially parses the string again.
>
> So I am looking for some advice on the best way to approach this
> problem. If anyone has done something similar before, I would
> appreciate any suggestions that you have for me.
>
> Much thanks,
>
> Justin Murray
> jmurray at aerotech.com
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address