[antlr-interest] Is parser control over the lexer possible?

Thu May 6 19:37:13 PDT 2010

I too would be interested in changing that behavior.  I use ANTLR as a
command parser, so instantiating everything for every new command is a lot
of overhead.  I think ANTLR needs a mode in where it only fetches one line
at a time, by calling a GetInput routine that we supply

 -Brian

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Mike Matera
Sent: Friday, May 07, 2010 10:13
To: Chris verBurg
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Is parser control over the lexer possible?

Hi Chris,

Yes, antlr reads the whole file into memory.  I don't know how to stop it
from doing that. 

Cheers
./m

Chris verBurg wrote:
> Hey all,
>
> OK, let me try a related but far less involved question:
>
> ANTLR tokenizes all input into an internal list before parsing 
> anything in that list.  (Right?)  Hence, it runs out of memory trying 
> to read my 6.2-million-line input file, because that list is huge.  
> What's the ANTLR way to handle such large input streams?
>
> Thanks,
> -Chris
>
>
>
>
> On Thu, Apr 29, 2010 at 4:33 PM, Chris verBurg
<cheetomonster at gmail.com>wrote:
>
>   
>> Hey guys,
>>
>> A question was posted a few days ago about dealing with an infinite 
>> input stream, and the suggestion was to subclass TokenStream so that 
>> it didn't read in all of the input upfront.
>>
>> I'm running into a similar problem, but before I go run off and 
>> subclass things I thought I'd see if there's a "best practice" for my 
>> situation.  It also overlaps with the "how do I use keywords as
identifiers<http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741>"
>> FAQ.
>>
>> I have a data-file grammar that recognizes strings, numbers, and a 
>> ton of keywords.  Pretending "VERSION" and "LIMIT" are two keywords, 
>> here's (part
>> of) the .g file:
>>
>> data_file:
>>   'VERSION' STRING ';'
>>   | 'LIMIT' NUMBER ';'
>>   ;
>>
>> NUMBER:
>>   ('-'|'+')? ('0'..'9')+
>>   | ('-'|'+')? ('0'..'9')* '.' ('0'..'9')*
>>   ;
>>
>> STRING:
>>   ('a'..'z' | 'A'..'Z' | '_' | '.' | '0'..'9')+ ;
>>
>>
>> Problem input #1:
>>
>> VERSION 1.2 ;
>>
>> The "1.2" is lexed as a number instead of a string, so I get a parse
error.
>>
>> Problem input #2:
>>
>> VERSION LIMIT ;
>>
>> The "LIMIT" is lexed as a keyword instead of a string, so I get a 
>> parse error.
>>
>>
>> I saw the FAQ about keywords-as-identifiers, but I don't think it's 
>> helpful for me.  For the NUMBER-that-should-be-a-STRING problem, 
>> there's no exact string I could pass to 
>> input.LT(1).getText().equals(), because it requires a regex to match a
NUMBER.  The other solution was to make an "identifier"
>> rule to match all possibilities -- is the best solution here really 
>> to change the rule to 'VERSION' (STRING | NUMBER) ';'?
>>
>> For the keyword-that-should-be-a-STRING problem, I'm hesitant to use 
>> either of those solutions because of the sheer number of keywords in this
grammar.
>>
>>
>> Ideally what I'd like to do is what I did in Flex and Bison (which 
>> I'm porting this grammar from).  What I did there was have the parser 
>> control how the lexer interpreted subsequent tokens.  I embedded a 
>> rule in the parser, immediately after the 'VERSION' token, to tell 
>> Flex to enter a "force-the-next-token-to-be-a-STRING-no-matter-what" 
>> start state.  It worked beautifully.  I got most of the way through 
>> implementing that in my ANTLR grammar when I found out that 
>> ANTLRFileStream reads all the tokens in before the parser even starts 
>> up -- which means the parser can't give the lexer any direction over
token interpretation.
>>
>>
>> Thoughts, suggestions, outrageous flames?  Is there a "good" way to 
>> do this, or maybe is there a completely different approach I should take?
>>
>> Thanks!
>> -Chris
>>
>>
>>
>>     
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: 
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>   

This email and any attachments are intended for the sole use of the named
recipient(s) and contain(s) confidential information that may be
proprietary, privileged or copyrighted under applicable law. If you are not
the intended recipient, do not read, copy, or forward this email message or
any attachments. Delete this email message and any attachments immediately.

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address