[antlr-interest] Antlr noobie, nondeterminism abounds
Mark Lentczner
markl at glyphic.com
Sun May 9 12:26:34 PDT 2004
Wes -
Loring's and Bryan's comments give pointers on how to solve your direct
dilemma. However, I think your problem is elsewhere.
> I'm trying to translate an ABNF formal syntax (the IMAP protocol, RFC
> 3501 to be specific) into an antlr grammar, and so far straight
> translation is not working very well.
Just so happens that I was once part of a group that wrote an IMAP
client, and know that protocol all too well. If one is going to parse
it with Antlr (or any lexer/parser tool for that matter), then you need
a different approach.
The syntax of IMAP is not very "token regular" (term I just invented).
That is, what constitutes a token varies greatly depending where in the
grammar you are. Contrast this with Java: The sequence "123" has the
same meaning almost everywhere in the grammar. Furthermore, the few
exceptions (within strings and comments) exist only within otherwise
identifiable lexer tokens.
In general, the lexer needs to always be able to tell what the next
token is without knowing anything about the parse state of the grammar.
Remember: All the non-protected lexer rules are put together as
alternatives in one big synthesized lexer rule called nextToken.
IMAP is specified with a grammar that goes all the way down to
characters. The specification doesn't distinguish lexical and
grammatical stages. When trying to find the place to split such a
grammar into lexing and parsing, one tries to find a natural place in
the grammar where there is a set of productions that will become lexer
tokens: The token productions must be disambiguated solely on their
first n characters (without parser state), and make use of only smaller
productions. All larger productions must make use of only the token
productions and nothing smaller.
Sometimes you must look larger or smaller than traditional lexer tokens
for a given grammar. Consider something to parse this example file:
spin part 1234
count 14
retest part 1627
count 1000
Assume the spec says that part number must be four digits, and that
counts are non-negative integers. The normal approach wouldn't work:
--in parser--
statement: action | count ;
action: verb part ;
verb: "spin" | "retest" | "freeze" | "disintegrate" ;
part: "part" PART_ID ;
count: "count" NUMBER ;
--in lexer--
PART_ID: DIGIT DIGIT DIGIT DIGIT ;
NUMBER: (DIGIT)+ ;
protected DIGIT: '0'..'9' ;
WS: (' ' | '\t')+ { $setType(SKIP); } ;
NL: '\n' { newline(); $setType(SKIP); } ;
WORD: ('a'..'z')+ ;
This doesn't work because the lexer never knows if it should treat the
next digit as the start of a PART_ID or a NUMBER. Even with k=5, this
doesn't work because "1000" is a valid NUMBER.
The solution in this case is to make the tokens much bigger. This
particular grammar always identifies its numeric entities with a
prefix, so the lexer knows what to do without knowing what the parser
is doing. The following works:
--in parser--
statement: action | COUNT ;
action: verb PART ;
verb: "spin" | "retest" | "freeze" | "disintegrate" ;
--in lexer--
PART: "part" WS DIGIT DIGIT DIGIT DIGIT ;
COUNT: "count" WS (DIGIT)+ ;
protected DIGIT: '0'..'9' ;
WS: (' ' | '\t')+ { $setType(SKIP); } ;
NL: '\n' { newline(); $setType(SKIP); } ;
WORD: ('a'..'z')+ ;
If the example were modified a bit:
spin 1234
14
retest 1627
1000
then the lexer with normal tokens won't know what to do with digit: Is
it a part id or a count? One could have the parser tell the lexer
what state it is in and then make the lexer rules dependent on parser
state (see http://www.antlr.org/doc/streams.html#lexerstates for more
info). This approach is only workable if there are a small number of
lexer states.
Another approach is to make the tokens much smaller and do the work in
the parser:
--in parser--
statement: action | count ;
action: verb part ;
verb: "spin" | "retest" | "freeze" | "disintegrate" ;
part: DIGIT DIGIT DIGIT DIGIT ;
count: number ;
number: (DIGIT)+ ;
--in lexer--
DIGIT: '0'..'9' ;
WS: ( ' ' | '\t' )+ { $setType(SKIP); } ;
NL: '\n' { newline(); $setType(SKIP); } ;
WORD: ('a'..'z')+ ;
You will probably also want to add actions to the parser rules to
combine the tokens into more manageable units. If you were building
ASTs, you'd create abstract tokens (not returned by the lexer):
tokens { NUMBER; }
...
number: { string n; } (d:DIGIT { n += d->getText(); })+ { ## =
#[NUMBER, n]; } ;
If you were computing as you parsed:
number returns [ int n ]:
{ n = 0; } ( { int d; } d=digit { n = 10*n + d; } )+ ;
digit returns [ int d ]:
t:DIGIT { d = t->getText()[0] - '0'; } ;
Alas, the IMAP protocol is rife with this sort of problem: There are
numerous places where the same sequence of leading characters would
mean totally different things. If one tried to tease it apart into
lexer states my guess is that you'd have dozens. Trying to multiplex
lexers or introduce lexer states would be a nightmare.
I think the only possible solution is to make the lexer basically pass
individual characters as tokens to the parser, and render almost all of
RFC 3501's ABNF rules as Antlr parser rules. Where you want a single
token in your parsed result (or a single computation), use actions in
the parser rules to combine the individual character tokens into
meaningful strings and numbers.
- Mark
Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list