[antlr-interest] Token parsing speed

Thu Feb 24 09:31:56 PST 2011

On 02/24/2011 04:25 AM, Richard Druce wrote:
> I have a question on the general best practice and speed between of
> using  tokens vs rules to construct parts of the grammar. Our language
> has many phrases that share words, a simplified sample being 'first'
> and 'first second'.  Would I be better off putting them in as tokens
> or rules from a speed of parsing perspective. Some of my tokens also
> contain whitespace.
>
> i.e
> rule1: FIRST;
> 
> rule2: FIRST WS SECOND;
> 
> FIRST: 'first';
> SECOND: 'second'
> 
> WS: ' ';
> 
> or
> 
> rule1: RULE1;
> 
> rule2: RULE2;
> 
> RULE1: 'first second';
> 
> RULE2: 'first';
> 
> Thanks,
> 
> Richard

Richard,
	The answer to your question has to do with the answer to these questions:

	How much whitespace is allowed between 'first' and 'second'?

	If the answer is "exactly 1 space", then your second method could
suffice.  If the answer is more than 1, then your first example could
work, depending on what else might be allowed.  Do you have any tokens
in your language that you either SKIP() or send to the HIDDEN channel?
If so, WS should be one of them, and then you don't care what comes
between FIRST and SECOND.

(You *will* run into more ambiguity problems if you don't code your
lexer rules correctly in the second case.)

	If your language allows inline comments, could you put a comment
between the 'first' and 'second' token?  How about a NL?  How about a TAB?

	In most languages, whitespace is considered a delimter, or a separator.
 It is only required in those cases where it is necessary to keep the
lexer from over lexing various keywords or identifier names, or numbers.
 Otherwise, they can be ignored.  When I write compiler front-ends, I
usually IGNORE (skip) whitespace in my grammar and let the lexer lex
what it finds for tokens and just parse the tokens it returns.  The only
notable exceptions I've run into is things like pre-processor directives
which need to be recognized and acted upon at the time the lexer finds
them, and not put them into the parsers token stream.  To that end, I
never put WS tokens in my parser rules (only LEXER rules).

	The simple answer is:  Put into tokens only what your parser expects to
see and still be able to parser the language correctly.

-- 
Kevin J. Cummings
kjchome at verizon.net
cummings at kjchome.homeip.net
cummings at kjc386.framingham.ma.us
Registered Linux User #1232 (http://counter.li.org)