[antlr-interest] Overlapping tokens

David Maxwell david at crlf.net
Wed Oct 5 13:46:32 PDT 2005


Hi all,

Thanks to everyone who replied (on topic ;-) to my C++ beginner
questions.  That did help me get further.

Now I have a more specific query.

In a lex/yacc example, I could do something like this:

"FooBar"                { printf ("Found a FOOBAR lex token\n");
                          strcpy(yylval.stval,yytext);
                          return FOOBAR; }

[a-zA-Z_]*              { printf("Found a ID lex token\n");
                          strcpy(yylval.stval,yytext);
                          return ID; }

If the input text is:
=====
Foobar
=====

The lexer will pass a FOOBAR token to the parser, which then either
accepts it, or not, based on the current position in the grammar.

Any text of the form [a-zA-Z_]* that doesn't match "FooBar" will result in
an ID token being returned to the paser.

In lex/yacc, that is valid for strings such as "Foo".

In Antlr, a run-time error is produced, even with k > length(FooBar) >
length(Foo)

Parse exception: <cin>:1:4: expecting ''B'', found '' ''

So, what I'm confused about is this: If I was writing a language without
reserved keywords, I would expect to have to match every piece of
textual input and check it against a list of keywords, and make sure the
parser could use it as a keyword token if appropriate, or an ID if
appropriate. In that case, the 'ID' token matcher would be the only
entry in the lexer...

However, in a lanaguage with reserved keywords, the above seems like a
reasonable way to write the lexer patterns, but every substring of the
reserved keywords ends up being reserved (in-effect) too.

Why does Antlr demand that the rest of the token must be 'ooBar' once it
sees the 'F' - when it has another valid token to use - even when given
enough 'k' to tell the difference?

Thanks again,

							David



More information about the antlr-interest mailing list