[antlr-interest] Is parser control over the lexer possible?

Thu May 6 18:06:48 PDT 2010

Hey all,

OK, let me try a related but far less involved question:

ANTLR tokenizes all input into an internal list before parsing anything in
that list.  (Right?)  Hence, it runs out of memory trying to read my
6.2-million-line input file, because that list is huge.  What's the ANTLR
way to handle such large input streams?

Thanks,
-Chris

On Thu, Apr 29, 2010 at 4:33 PM, Chris verBurg <cheetomonster at gmail.com>wrote:

> Hey guys,
>
> A question was posted a few days ago about dealing with an infinite input
> stream, and the suggestion was to subclass TokenStream so that it didn't
> read in all of the input upfront.
>
> I'm running into a similar problem, but before I go run off and subclass
> things I thought I'd see if there's a "best practice" for my situation.  It
> also overlaps with the "how do I use keywords as identifiers<http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741>"
> FAQ.
>
> I have a data-file grammar that recognizes strings, numbers, and a ton of
> keywords.  Pretending "VERSION" and "LIMIT" are two keywords, here's (part
> of) the .g file:
>
> data_file:
>   'VERSION' STRING ';'
>   | 'LIMIT' NUMBER ';'
>   ;
>
> NUMBER:
>   ('-'|'+')? ('0'..'9')+
>   | ('-'|'+')? ('0'..'9')* '.' ('0'..'9')*
>   ;
>
> STRING:
>   ('a'..'z' | 'A'..'Z' | '_' | '.' | '0'..'9')+ ;
>
>
> Problem input #1:
>
> VERSION 1.2 ;
>
> The "1.2" is lexed as a number instead of a string, so I get a parse error.
>
> Problem input #2:
>
> VERSION LIMIT ;
>
> The "LIMIT" is lexed as a keyword instead of a string, so I get a parse
> error.
>
>
> I saw the FAQ about keywords-as-identifiers, but I don't think it's helpful
> for me.  For the NUMBER-that-should-be-a-STRING problem, there's no exact
> string I could pass to input.LT(1).getText().equals(), because it requires
> a regex to match a NUMBER.  The other solution was to make an "identifier"
> rule to match all possibilities -- is the best solution here really to
> change the rule to 'VERSION' (STRING | NUMBER) ';'?
>
> For the keyword-that-should-be-a-STRING problem, I'm hesitant to use either
> of those solutions because of the sheer number of keywords in this grammar.
>
>
> Ideally what I'd like to do is what I did in Flex and Bison (which I'm
> porting this grammar from).  What I did there was have the parser control
> how the lexer interpreted subsequent tokens.  I embedded a rule in the
> parser, immediately after the 'VERSION' token, to tell Flex to enter a
> "force-the-next-token-to-be-a-STRING-no-matter-what" start state.  It worked
> beautifully.  I got most of the way through implementing that in my ANTLR
> grammar when I found out that ANTLRFileStream reads all the tokens in before
> the parser even starts up -- which means the parser can't give the lexer any
> direction over token interpretation.
>
>
> Thoughts, suggestions, outrageous flames?  Is there a "good" way to do
> this, or maybe is there a completely different approach I should take?
>
> Thanks!
> -Chris
>
>
>