[antlr-interest] Antlr noobie, nondeterminism abounds

Sun May 9 12:26:34 PDT 2004

Wes -

Loring's and Bryan's comments give pointers on how to solve your direct 
dilemma.  However, I think your problem is elsewhere.

> I'm trying to translate an ABNF formal syntax (the IMAP protocol, RFC 
> 3501 to be specific) into an antlr grammar, and so far straight 
> translation is not working very well.

Just so happens that I was once part of a group that wrote an IMAP 
client, and know that protocol all too well.  If one is going to parse 
it with Antlr (or any lexer/parser tool for that matter), then you need 
a different approach.

The syntax of IMAP is not very "token regular" (term I just invented).  
That is, what constitutes a token varies greatly depending where in the 
grammar you are.  Contrast this with Java: The sequence "123" has the 
same meaning almost everywhere in the grammar.  Furthermore, the few 
exceptions (within strings and comments) exist only within otherwise 
identifiable lexer tokens.

In general, the lexer needs to always be able to tell what the next 
token is without knowing anything about the parse state of the grammar. 
  Remember:  All the non-protected lexer rules are put together as 
alternatives in one big synthesized lexer rule called nextToken.

IMAP is specified with a grammar that goes all the way down to 
characters.  The specification doesn't distinguish lexical and 
grammatical stages.  When trying to find the place to split such a 
grammar into lexing and parsing, one tries to find a natural place in 
the grammar where there is a set of productions that will become lexer 
tokens:  The token productions must be disambiguated solely on their 
first n characters (without parser state), and make use of only smaller 
productions.  All larger productions must make use of only the token 
productions and nothing smaller.

Sometimes you must look larger or smaller than traditional lexer tokens 
for a given grammar.  Consider something to parse this example file:
	spin part 1234
	count 14
	retest part 1627
	count 1000

Assume the spec says that part number must be four digits, and that 
counts are non-negative integers.  The normal approach wouldn't work:

--in parser--
statement: action | count ;
action: verb part ;
verb: "spin" | "retest" | "freeze" | "disintegrate" ;
part: "part" PART_ID ;
count: "count" NUMBER ;

--in lexer--
PART_ID: DIGIT DIGIT DIGIT DIGIT ;
NUMBER: (DIGIT)+ ;
protected DIGIT: '0'..'9' ;
WS: (' ' | '\t')+ { $setType(SKIP); } ;
NL: '\n' { newline(); $setType(SKIP); } ;
WORD: ('a'..'z')+ ;

This doesn't work because the lexer never knows if it should treat the 
next digit as the start of a PART_ID or a NUMBER.  Even with k=5, this 
doesn't work because "1000" is a valid NUMBER.

The solution in this case is to make the tokens much bigger.  This 
particular grammar always identifies its numeric entities with a 
prefix, so the lexer knows what to do without knowing what the parser 
is doing.  The following works:

--in parser--
statement: action | COUNT ;
action: verb PART ;
verb: "spin" | "retest" | "freeze" | "disintegrate" ;

--in lexer--
PART: "part" WS DIGIT DIGIT DIGIT DIGIT ;
COUNT: "count" WS (DIGIT)+ ;
protected DIGIT: '0'..'9' ;
WS: (' ' | '\t')+ { $setType(SKIP); } ;
NL: '\n' { newline(); $setType(SKIP); } ;
WORD: ('a'..'z')+ ;

If the example were modified a bit:
	spin 1234
	14
	retest 1627
	1000
then the lexer with normal tokens won't know what to do with digit: Is 
it a part id or a count?  One could  have the parser tell the lexer 
what state it is in and then make the lexer rules dependent on parser 
state (see http://www.antlr.org/doc/streams.html#lexerstates for more 
info).  This approach is only workable if there are a small number of 
lexer states.

Another approach is to make the tokens much smaller and do the work in 
the parser:

--in parser--
statement: action | count ;
action: verb part ;
verb: "spin" | "retest" | "freeze" | "disintegrate" ;
part: DIGIT DIGIT DIGIT DIGIT ;
count: number ;
number: (DIGIT)+ ;

--in lexer--
DIGIT: '0'..'9' ;
WS: ( ' ' | '\t' )+ { $setType(SKIP); } ;
NL: '\n' { newline(); $setType(SKIP); } ;
WORD: ('a'..'z')+ ;

You will probably also want to add actions to the parser rules to 
combine the tokens into more manageable units.  If you were building 
ASTs, you'd create abstract tokens (not returned by the lexer):

tokens { NUMBER; }
...
number: { string n; } (d:DIGIT { n += d->getText(); })+ { ## = 
#[NUMBER, n]; } ;

If you were computing as you parsed:

number returns [ int n ]:
     { n = 0; } ( { int d; } d=digit { n = 10*n + d; } )+ ;
digit returns [ int d ]:
     t:DIGIT { d = t->getText()[0] - '0'; } ;

Alas, the IMAP protocol is rife with this sort of problem: There are 
numerous places where the same sequence of leading characters would 
mean totally different things.  If one tried to tease it apart into 
lexer states my guess is that you'd have dozens.  Trying to multiplex 
lexers or introduce lexer states would be a nightmare.

I think the only possible solution is to make the lexer basically pass 
individual characters as tokens to the parser, and render almost all of 
RFC 3501's ABNF rules as Antlr parser rules.  Where you want a single 
token in your parsed result (or a single computation), use actions in 
the parser rules to combine the individual character tokens into 
meaningful strings and numbers.

	- Mark

Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/