[antlr-interest] How do I accept input ending with a newline *or* EOF?

chris king kingces95 at gmail.com
Fri Feb 4 14:46:48 PST 2011


Douglas, thanks for the reply. Yes, it occurred to me to try implement the
#if and #ifdef in the lexer. The problem I encountered is that the
pre-processor statements can contain arithmetic expressions. I don't think
those can be expressed in a lexer alone.

It also occurred to me the channels might be of use. The problem I
encountered was that channels only take affect after a lexer rule has been
chosen and so don't seem to address the root problem which is that depending
on the context or the parser I need some lexer rules to be enabled and
others disabled. So for example, the text inside a disabled block need not
be parsable by the compiler and so if I'm inside a disabled block I simply
want to lex everything upto the #endif into one big token. That rule to
select everything upto the next #endif is very aggressive and as such will
usually end up with the longest match and hence selected when it should be
-- in the case where the #if block is enabled.

I think anyone building a C#, C, or Java compiler must have encountered this
issue. The only way I can see around this problem is to lex portions of the
input string twice. But ANTLR being so nice to work with in other respects
makes me wonder if I'm not missing some convention that would allow me to
lex the string only once. To do that I think I'd need lex rules to be
enabled and disabled at runtime.

Thanks,
Chris

On Fri, Feb 4, 2011 at 10:44 AM, Douglas Godfrey
<douglasgodfrey at gmail.com>wrote:

> Implement #if and #ifdef by parsing the preprocessor statements in the
> lexer and setting the
> channel for all non preprocessor tokens based on whether  the proprocessor
> statements
> select the "true" case.
>
> The parser would only see the lexer tokens that were enabled by the #if and
> #ifdef statements.
>
>
> On Thu, Feb 3, 2011 at 7:00 PM, chris king <kingces95 at gmail.com> wrote:
>
>> Kirby thanks! That helped a ton and thanks for that + vs * tip. A real
>> life
>> saver.
>>
>> I have another problem and I'm hoping you can point me in the right
>> direction. I'm trying to chose between two approaches for building for a
>> pre-processor. The first (1) approach is to have the pre-processor pass
>> tokens to the compiler. The second (2) approach is to have the
>> pre-processor
>> pass strings (those that have not been #if defed out) to the compiler. The
>> former seems more natural but complicates the lexer because the the lexing
>> is context sensitive (see below). The latter simplifies both pre-processor
>> and compiler but feels ugly because it requires the input to be lexered
>> twice.
>>
>> As I said, the problem I encountered with the first approach is that the
>> lexer is context sensitive. For example, consider the following toy
>> grammar
>> where pre-processor identifiers can be upper or lower case but language
>> identifiers can only be lower case. The input "'#define HELLO" parses fine
>> but "#define hello" fails because (I assume) "hello" could be match by two
>> lexer productions -- ID and PP_ID. I tried inserting a predicate in ID
>> (e.g. ID : {false}?=> 'a'..'z';) to provide context but if I do then
>> ANTLRWorks spins when I try to interpret any input. I've also tried
>> fiddling
>> with the order of ID and PP_ID but each ordering has it's own problems
>> (e.g.
>> can only make one of the following for a given order: { "hello", "#define
>> hello" }).
>>
>> start
>>        : input*
>>        ;
>> input
>>        : ID+ (NEW_LINE | EOF)
>>        | pp_input
>>        ;
>>
>> pp_input
>>        : '#' 'define' PP_ID+ (NEW_LINE | EOF)
>>        ;
>>
>> NEW_LINE
>>        : '\r' '\n'
>>        ;
>> ID
>>        : 'a'..'z';
>>
>> PP_ID
>>        : 'a'..'z'
>>        | 'A'..'Z';
>>
>> This seems like a standard 101 type problem space so hopefully you've
>> explored it and can direct me! :)
>>
>> Thanks,
>> Chris
>>
>> On Mon, Jan 31, 2011 at 4:03 PM, Kirby Bohling <kirby.bohling at gmail.com
>> >wrote:
>>
>> > No idea if it is related to the problem, but you likely really want to
>> > have ID use a '+' not a '*' after ('a'..'z'), otherwise ID to match
>> > nothing and be cause an infinite loop while lexing at points
>> > (generally speaking, any time rules like
>> >
>> > bar: (foo)*;
>> >
>> > foo: (baz)*;
>> >
>> > You are just asking for problems.  Whether foo and baz are lexers or
>> > parser rules.  Every time I do that it is a mistake (or a failure of
>> > imagination).  Generally speaking, low level items you want to force
>> > the consumption of something, and make them optional at a higher level
>> > (at least that has been true in my limited experience).
>> >
>> > I believe the EOF is precisely because of the lack of a + vs. a *
>> > there.  As rather then consume the EOF, you can spin consuming nothing
>> > forever.  But I didn't actually crack out ANTLR and check.
>> >
>> > Also, unless you really know what you are doing, you might want to
>> > skip using constants in your parser rules.  While many the examples do
>> > so, from what I've read, it can have complex interaction (it generates
>> > a token for it internally that can't be seen).  I'd try making a
>> > NEWLINE token and seeing if that helps make the error message any
>> > clearer.
>> >
>> > Kirby
>> >
>> >
>> > On Mon, Jan 31, 2011 at 5:49 PM, chris king <kingces95 at gmail.com>
>> wrote:
>> > > Hello! I'm trying to write a grammar that will accept lines of zero or
>> > more
>> > > IDs and I'd like to allow the last line to end in a new line *or *EOF.
>> I
>> > > came up with this grammar:
>> > >
>> > > grammar test;
>> > >
>> > > start
>> > >  : input*
>> > >  ;
>> > >
>> > > input
>> > >  : ID* ('\n' | EOF)
>> > >  ;
>> > >
>> > > ID
>> > >  : ('a'..'z')*
>> > >  ;
>> > >
>> > > WHITESPACE
>> > >  : ' '+ {skip();}
>> > >  ;
>> > >
>> > > But got this error from ANTLRWorks saying start has un-reachable
>> > > alternatives:
>> > >
>> > > [15:38:33] error(201): test2.g:9:5: The following alternatives can
>> never
>> > be
>> > > matched: 2
>> > >
>> > > If I remove the reference to EOF than everything works but I have to
>> end
>> > the
>> > > last line in a new line and I don't want to have to do that. Any
>> > > suggestions?
>> > >
>> > > Thanks,
>> > > Chris
>> > >
>> > > List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> > > Unsubscribe:
>> > http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>> > >
>> >
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:
>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>
>


More information about the antlr-interest mailing list