[antlr-interest] Is parser control over the lexer possible?

Cliff Hudson cliff.s.hudson at gmail.com
Fri May 7 00:58:11 PDT 2010

It's not at all difficult to let ANTLR handle indeterminate amounts of
input.  As has been suggested before, you will need to provide your own
token stream to provide tokens at the rate you wish to give them.  The
problem is one of interfacing the pull model of the parser with the push
model of your input. 

Lets assume you have a parser which runs indefinitely so long as you keep
feeding it tokens, until you feed it the EOF token.  For example:

command : expr+;

So long as the parser can receive tokens which match the grammar for expr, a
call to command() will not return.  In your case, you may very well have an
infinite number of tokens, but you aren't wanting to provide them up-front.
The solution is a token stream which provides tokens in batches you choose.
You can do this by letting ANTLR control the input loop, as follows:

Create your custom token stream.
Pass it to the parser.
Execute command();
In your token stream's implementation of NextToken() you will either:
1) If you have no stored tokens, ask the "user" for additional input.  If
you are reading from a file, this might consist of reading in a line and
sending it to the lexer, which will spit out a bunch of tokens. In an
interactive interpreter sort of scenario, you might prompt the user for a
line of text. Store these tokens in some ordered collection, and return the
first one; or
2) If you still have stored tokens, return the next one.

In this way, ANTLR will parse only as much as you have to give at any time,
and will wait for you to provide it more tokens if you don't have any at the
moment.  When you are finished with your session, you can pass EOF back from
NextToken() which will tell the parser that there is no more input to be
had.  The call to command() will return.

Note that as long as command() hasn't returned, ANTLR will be able to
process rules and execute grammar actions when you return tokens to it from

Does that make sense?

On the subject of the parser controlling how the lexer works, this is not
something that ANTLR is designed to do "out of the box".  However, you
control the token stream, and you control what happens when certain actions
are invoked.  You could construct a system whereby when an action is
executed, it changes the lexer used by your NextToken() call to fetch
additional tokens.  Remember, you are utterly in control of which tokens are
provided to the parser and - as illustrated by the technique above - when
they are provided.  You control which lexer you use and when.  But I would
caution you that such a system can become fairly complex - for instance you
have to ensure that token IDs don't conflict between your lexers, which may
require mapping them by hand in your NextToken call and altering the token
IDs used by the generated parser itself.  I've not tried this before, though
I'd be a bit surprised if it couldn't be done.

I personally have a system where there are two grammars and two parsers.
The 'outer' grammar knows about areas where the inner grammar would be used.
Those areas are lexed as a single token (quoted string in this case).  When
the rule in the outer grammar matches the quoted string, it uses a lexer and
parser for the 'inner' grammar to process it.

Hope this has helped.

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Brian Catlin
Sent: Thursday, May 06, 2010 7:37 PM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Is parser control over the lexer possible?

I too would be interested in changing that behavior.  I use ANTLR as a
command parser, so instantiating everything for every new command is a lot
of overhead.  I think ANTLR needs a mode in where it only fetches one line
at a time, by calling a GetInput routine that we supply


-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Mike Matera
Sent: Friday, May 07, 2010 10:13
To: Chris verBurg
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Is parser control over the lexer possible?

Hi Chris,

Yes, antlr reads the whole file into memory.  I don't know how to stop it
from doing that. 


Chris verBurg wrote:
> Hey all,
> OK, let me try a related but far less involved question:
> ANTLR tokenizes all input into an internal list before parsing 
> anything in that list.  (Right?)  Hence, it runs out of memory trying 
> to read my 6.2-million-line input file, because that list is huge.  
> What's the ANTLR way to handle such large input streams?
> Thanks,
> -Chris
> On Thu, Apr 29, 2010 at 4:33 PM, Chris verBurg
<cheetomonster at gmail.com>wrote:
>> Hey guys,
>> A question was posted a few days ago about dealing with an infinite 
>> input stream, and the suggestion was to subclass TokenStream so that 
>> it didn't read in all of the input upfront.
>> I'm running into a similar problem, but before I go run off and 
>> subclass things I thought I'd see if there's a "best practice" for my 
>> situation.  It also overlaps with the "how do I use keywords as
>> FAQ.
>> I have a data-file grammar that recognizes strings, numbers, and a 
>> ton of keywords.  Pretending "VERSION" and "LIMIT" are two keywords, 
>> here's (part
>> of) the .g file:
>> data_file:
>>   | 'LIMIT' NUMBER ';'
>>   ;
>>   ('-'|'+')? ('0'..'9')+
>>   | ('-'|'+')? ('0'..'9')* '.' ('0'..'9')*
>>   ;
>>   ('a'..'z' | 'A'..'Z' | '_' | '.' | '0'..'9')+ ;
>> Problem input #1:
>> VERSION 1.2 ;
>> The "1.2" is lexed as a number instead of a string, so I get a parse
>> Problem input #2:
>> The "LIMIT" is lexed as a keyword instead of a string, so I get a 
>> parse error.
>> I saw the FAQ about keywords-as-identifiers, but I don't think it's 
>> helpful for me.  For the NUMBER-that-should-be-a-STRING problem, 
>> there's no exact string I could pass to 
>> input.LT(1).getText().equals(), because it requires a regex to match a
NUMBER.  The other solution was to make an "identifier"
>> rule to match all possibilities -- is the best solution here really 
>> to change the rule to 'VERSION' (STRING | NUMBER) ';'?
>> For the keyword-that-should-be-a-STRING problem, I'm hesitant to use 
>> either of those solutions because of the sheer number of keywords in this
>> Ideally what I'd like to do is what I did in Flex and Bison (which 
>> I'm porting this grammar from).  What I did there was have the parser 
>> control how the lexer interpreted subsequent tokens.  I embedded a 
>> rule in the parser, immediately after the 'VERSION' token, to tell 
>> Flex to enter a "force-the-next-token-to-be-a-STRING-no-matter-what" 
>> start state.  It worked beautifully.  I got most of the way through 
>> implementing that in my ANTLR grammar when I found out that 
>> ANTLRFileStream reads all the tokens in before the parser even starts 
>> up -- which means the parser can't give the lexer any direction over
token interpretation.
>> Thoughts, suggestions, outrageous flames?  Is there a "good" way to 
>> do this, or maybe is there a completely different approach I should take?
>> Thanks!
>> -Chris
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: 
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address

This email and any attachments are intended for the sole use of the named
recipient(s) and contain(s) confidential information that may be
proprietary, privileged or copyrighted under applicable law. If you are not
the intended recipient, do not read, copy, or forward this email message or
any attachments. Delete this email message and any attachments immediately.

List: http://www.antlr.org/mailman/listinfo/antlr-interest

List: http://www.antlr.org/mailman/listinfo/antlr-interest

More information about the antlr-interest mailing list