[antlr-interest] Is parser control over the lexer possible?
cheetomonster at gmail.com
Thu May 6 18:06:48 PDT 2010
OK, let me try a related but far less involved question:
ANTLR tokenizes all input into an internal list before parsing anything in
that list. (Right?) Hence, it runs out of memory trying to read my
6.2-million-line input file, because that list is huge. What's the ANTLR
way to handle such large input streams?
On Thu, Apr 29, 2010 at 4:33 PM, Chris verBurg <cheetomonster at gmail.com>wrote:
> Hey guys,
> A question was posted a few days ago about dealing with an infinite input
> stream, and the suggestion was to subclass TokenStream so that it didn't
> read in all of the input upfront.
> I'm running into a similar problem, but before I go run off and subclass
> things I thought I'd see if there's a "best practice" for my situation. It
> also overlaps with the "how do I use keywords as identifiers<http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741>"
> I have a data-file grammar that recognizes strings, numbers, and a ton of
> keywords. Pretending "VERSION" and "LIMIT" are two keywords, here's (part
> of) the .g file:
> 'VERSION' STRING ';'
> | 'LIMIT' NUMBER ';'
> ('-'|'+')? ('0'..'9')+
> | ('-'|'+')? ('0'..'9')* '.' ('0'..'9')*
> ('a'..'z' | 'A'..'Z' | '_' | '.' | '0'..'9')+ ;
> Problem input #1:
> VERSION 1.2 ;
> The "1.2" is lexed as a number instead of a string, so I get a parse error.
> Problem input #2:
> VERSION LIMIT ;
> The "LIMIT" is lexed as a keyword instead of a string, so I get a parse
> I saw the FAQ about keywords-as-identifiers, but I don't think it's helpful
> for me. For the NUMBER-that-should-be-a-STRING problem, there's no exact
> string I could pass to input.LT(1).getText().equals(), because it requires
> a regex to match a NUMBER. The other solution was to make an "identifier"
> rule to match all possibilities -- is the best solution here really to
> change the rule to 'VERSION' (STRING | NUMBER) ';'?
> For the keyword-that-should-be-a-STRING problem, I'm hesitant to use either
> of those solutions because of the sheer number of keywords in this grammar.
> Ideally what I'd like to do is what I did in Flex and Bison (which I'm
> porting this grammar from). What I did there was have the parser control
> how the lexer interpreted subsequent tokens. I embedded a rule in the
> parser, immediately after the 'VERSION' token, to tell Flex to enter a
> "force-the-next-token-to-be-a-STRING-no-matter-what" start state. It worked
> beautifully. I got most of the way through implementing that in my ANTLR
> grammar when I found out that ANTLRFileStream reads all the tokens in before
> the parser even starts up -- which means the parser can't give the lexer any
> direction over token interpretation.
> Thoughts, suggestions, outrageous flames? Is there a "good" way to do
> this, or maybe is there a completely different approach I should take?
More information about the antlr-interest