[antlr-interest] Is parser control over the lexer possible?

Chris verBurg cheetomonster at gmail.com
Thu Apr 29 16:33:18 PDT 2010


Hey guys,

A question was posted a few days ago about dealing with an infinite input
stream, and the suggestion was to subclass TokenStream so that it didn't
read in all of the input upfront.

I'm running into a similar problem, but before I go run off and subclass
things I thought I'd see if there's a "best practice" for my situation.  It
also overlaps with the "how do I use keywords as
identifiers<http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741>"
FAQ.

I have a data-file grammar that recognizes strings, numbers, and a ton of
keywords.  Pretending "VERSION" and "LIMIT" are two keywords, here's (part
of) the .g file:

data_file:
  'VERSION' STRING ';'
  | 'LIMIT' NUMBER ';'
  ;

NUMBER:
  ('-'|'+')? ('0'..'9')+
  | ('-'|'+')? ('0'..'9')* '.' ('0'..'9')*
  ;

STRING:
  ('a'..'z' | 'A'..'Z' | '_' | '.' | '0'..'9')+ ;


Problem input #1:

VERSION 1.2 ;

The "1.2" is lexed as a number instead of a string, so I get a parse error.

Problem input #2:

VERSION LIMIT ;

The "LIMIT" is lexed as a keyword instead of a string, so I get a parse
error.


I saw the FAQ about keywords-as-identifiers, but I don't think it's helpful
for me.  For the NUMBER-that-should-be-a-STRING problem, there's no exact
string I could pass to input.LT(1).getText().equals(), because it requires a
regex to match a NUMBER.  The other solution was to make an "identifier"
rule to match all possibilities -- is the best solution here really to
change the rule to 'VERSION' (STRING | NUMBER) ';'?

For the keyword-that-should-be-a-STRING problem, I'm hesitant to use either
of those solutions because of the sheer number of keywords in this grammar.


Ideally what I'd like to do is what I did in Flex and Bison (which I'm
porting this grammar from).  What I did there was have the parser control
how the lexer interpreted subsequent tokens.  I embedded a rule in the
parser, immediately after the 'VERSION' token, to tell Flex to enter a
"force-the-next-token-to-be-a-STRING-no-matter-what" start state.  It worked
beautifully.  I got most of the way through implementing that in my ANTLR
grammar when I found out that ANTLRFileStream reads all the tokens in before
the parser even starts up -- which means the parser can't give the lexer any
direction over token interpretation.


Thoughts, suggestions, outrageous flames?  Is there a "good" way to do this,
or maybe is there a completely different approach I should take?

Thanks!
-Chris


More information about the antlr-interest mailing list