[antlr-interest] Is parser control over the lexer possible?

Thu May 6 19:13:29 PDT 2010

Hi Chris,

Yes, antlr reads the whole file into memory.  I don't know how to stop 
it from doing that. 

Cheers
./m

Chris verBurg wrote:
> Hey all,
>
> OK, let me try a related but far less involved question:
>
> ANTLR tokenizes all input into an internal list before parsing anything in
> that list.  (Right?)  Hence, it runs out of memory trying to read my
> 6.2-million-line input file, because that list is huge.  What's the ANTLR
> way to handle such large input streams?
>
> Thanks,
> -Chris
>
>
>
>
> On Thu, Apr 29, 2010 at 4:33 PM, Chris verBurg <cheetomonster at gmail.com>wrote:
>
>   
>> Hey guys,
>>
>> A question was posted a few days ago about dealing with an infinite input
>> stream, and the suggestion was to subclass TokenStream so that it didn't
>> read in all of the input upfront.
>>
>> I'm running into a similar problem, but before I go run off and subclass
>> things I thought I'd see if there's a "best practice" for my situation.  It
>> also overlaps with the "how do I use keywords as identifiers<http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741>"
>> FAQ.
>>
>> I have a data-file grammar that recognizes strings, numbers, and a ton of
>> keywords.  Pretending "VERSION" and "LIMIT" are two keywords, here's (part
>> of) the .g file:
>>
>> data_file:
>>   'VERSION' STRING ';'
>>   | 'LIMIT' NUMBER ';'
>>   ;
>>
>> NUMBER:
>>   ('-'|'+')? ('0'..'9')+
>>   | ('-'|'+')? ('0'..'9')* '.' ('0'..'9')*
>>   ;
>>
>> STRING:
>>   ('a'..'z' | 'A'..'Z' | '_' | '.' | '0'..'9')+ ;
>>
>>
>> Problem input #1:
>>
>> VERSION 1.2 ;
>>
>> The "1.2" is lexed as a number instead of a string, so I get a parse error.
>>
>> Problem input #2:
>>
>> VERSION LIMIT ;
>>
>> The "LIMIT" is lexed as a keyword instead of a string, so I get a parse
>> error.
>>
>>
>> I saw the FAQ about keywords-as-identifiers, but I don't think it's helpful
>> for me.  For the NUMBER-that-should-be-a-STRING problem, there's no exact
>> string I could pass to input.LT(1).getText().equals(), because it requires
>> a regex to match a NUMBER.  The other solution was to make an "identifier"
>> rule to match all possibilities -- is the best solution here really to
>> change the rule to 'VERSION' (STRING | NUMBER) ';'?
>>
>> For the keyword-that-should-be-a-STRING problem, I'm hesitant to use either
>> of those solutions because of the sheer number of keywords in this grammar.
>>
>>
>> Ideally what I'd like to do is what I did in Flex and Bison (which I'm
>> porting this grammar from).  What I did there was have the parser control
>> how the lexer interpreted subsequent tokens.  I embedded a rule in the
>> parser, immediately after the 'VERSION' token, to tell Flex to enter a
>> "force-the-next-token-to-be-a-STRING-no-matter-what" start state.  It worked
>> beautifully.  I got most of the way through implementing that in my ANTLR
>> grammar when I found out that ANTLRFileStream reads all the tokens in before
>> the parser even starts up -- which means the parser can't give the lexer any
>> direction over token interpretation.
>>
>>
>> Thoughts, suggestions, outrageous flames?  Is there a "good" way to do
>> this, or maybe is there a completely different approach I should take?
>>
>> Thanks!
>> -Chris
>>
>>
>>
>>     
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>   

This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.