[antlr-interest] How do I accept input ending with a newline or EOF?

Thu Feb 3 19:21:23 PST 2011

Kirby, thanks again for the great tips and corrections. After digesting all
that I think I'll be able to work through many of my issues (some I didn't
even know I had!). I promise I did try to do my best to create a good
example :) but I'm afraid my many newbe mistakes distracted from my main
question. So instead of trying to author a grammar let me instead describe
what I'm trying to do and ask if you could please cook me up a small
grammar.

Suppose I want to create a grammar to recognize integers with one integer
per line. Additionally suppose I want a preprocessor such that I can wrap
groups of integers in #if [expr] ... #endif where if expr has basic
arithmetic operations. If expr evaluates to greater than 0 the enclosed
lines are lexed else they are not. What would that grammar (or even just
lexer) look like?

So for example, it should recognize this input and return an integer token
10 but not a token for 20. It should also ignore the comment.

10
#if (2-1)*3-10
20 and just ignore this comment
#endif

I cannot come up with a lexer that can lex the whole file without knowing if
it's in an active #if\#endif block so I'm stuck before I can even begin my
grammar.

Thanks (in advance!),
Chris

On Thu, Feb 3, 2011 at 5:05 PM, Kirby Bohling <kirby.bohling at gmail.com>wrote:

> On Thu, Feb 3, 2011 at 6:00 PM, chris king <kingces95 at gmail.com> wrote:
> > Kirby thanks! That helped a ton and thanks for that + vs * tip. A real
> life
> > saver.
> > I have another problem and I'm hoping you can point me in the right
> > direction. I'm trying to chose between two approaches for building for a
> > pre-processor. The first (1) approach is to have the pre-processor pass
> > tokens to the compiler. The second (2) approach is to have the
> pre-processor
> > pass strings (those that have not been #if defed out) to the compiler.
> The
> > former seems more natural but complicates the lexer because the the
> lexing
> > is context sensitive (see below). The latter simplifies both
> pre-processor
> > and compiler but feels ugly because it requires the input to be lexered
> > twice.
> > As I said, the problem I encountered with the first approach is that the
> > lexer is context sensitive. For example, consider the following toy
> grammar
> > where pre-processor identifiers can be upper or lower case but language
> > identifiers can only be lower case. The input "'#define HELLO" parses
> fine
> > but "#define hello" fails because (I assume) "hello" could be match by
> two
> > lexer productions -- ID and PP_ID. I tried inserting a predicate in ID
> > (e.g. ID : {false}?=> 'a'..'z';) to provide context but if I do then
> > ANTLRWorks spins when I try to interpret any input. I've also tried
> fiddling
> > with the order of ID and PP_ID but each ordering has it's own problems
> (e.g.
> > can only make one of the following for a given order: { "hello", "#define
> > hello" }).
>
> Chris,
>
>   I'm not much of an ANTLR expert, but here are my thoughts.  My
> first thought is go read what Jim Idle has to say on the list, he
> dispenses a lot of practical advice about parsing.  In this case, I
> believe Jim's stock advice is that you are doing too much in the lexer
> and parser.  In order to generate decent error messages, you should
> just skip figuring out if the item is in capital or lower case letters
> until after the input is parsed.  Just have one rule 'ID'.  Once you
> have the whole thing parsed and in memory, use a tree walker to
> validate that the input consumed by the ID that is after a '#define'
> is in all caps.  If it isn't, print out a really nice error message,
> when you have a lot of semantic information and context to print out a
> good error.
>
> First, you're mistaken about why "hello" vs. "HELLO" is causing you
> problems.  It can be lexed exactly one way.  If it matches two rules,
> the first rule listed wins (if you inverted ID and PP_ID, then ID
> tokens could never be generated).  One of the things about ANTLR is
> that the lexer runs to completion before the parser ever runs (even if
> it doesn't, it logically does).  I always split up my lexer and
> parser, specifically so I never get confused into thinking that the
> parser will cause the lexer to backtrack and generate a new token type
> (it would with other parser rules, but tokens, absolutely not).  The
> lexer has some other behavior that isn't like the parser.  That's the
> advanced lesson for a different e-mail.
>
> So if for some reason you insist on doing the validation early do
> something like this:
>
> pp_input:
> '#' 'define' (ID|PP_ID)+ (NEW_LINE | EOF)
> ;
>
> Down lower, I'd point out that you really want to to be: (ID|PP_ID)
> with no plus after it (or better yet just have it be 'ID' if you
> follow the advice above), I think your lexer rule is not doing what
> you expect.  Which leads to another piece of advice:
>
> Lex your stuff, and then dump it out in a format that makes sense to
> you, print the token and then the token text.  Make absolutely sure
> the lexer is generating what you expect (the lexer has a lot more
> surprises then the parser does IMHO).  Then go mentally execute the
> parser, and figure out what is going on.  In this case 'hello' is
> generating an 'ID' token, and your rule only allowed a 'PP_ID' token.
> That is the likely culprit with some of the problems you are
> describing.
>
> > start
> >         : input*
> >         ;
> > input
> >         : ID+ (NEW_LINE | EOF)
> >         | pp_input
> >         ;
> > pp_input
> >         : '#' 'define' PP_ID+ (NEW_LINE | EOF)
> >         ;
> > NEW_LINE
> >         : '\r' '\n'
> >         ;
> > ID
> >         : 'a'..'z';
> >
> > PP_ID
> >         : 'a'..'z'
> >         | 'A'..'Z';
>
> I'd not put '#' 'define' in a parser rule.  I'd make it a proper token
> for clarity.
>
> Next, I'm pretty sure you mean that ID has a value of 'hello', the way
> you have the rules constructed, the input 'hello' is 5 tokens (one of
> 'h', 'e', 'l', 'l', and 'o').  So first things, first, I'd fix that
> (and PP_ID analogously, if you insist on keeping it).
>
> You aren't skipping whitespace, but I'm assuming you just elided that
> from this grammar.
>
> Next, are you sure you don't mean ('\n' | '\r') for new line?  I think
> I'd just follow Jim's advice and ensure that you have a new line at
> the end.  It would make the EOF handling much easier (or barring that,
> just add EOF to as an alternative to NEWLINE rule and the grammar
> becomes easier to read).
>
> Finally, I'd stop using ANTLR works temporarily.  I'd generate the
> code, and step through the Java in a debugger.  One of the beauties of
> ANTLR is that you can tell a _lot_ about what is going on just from
> the stack trace and stepping through the code (recursive descent
> parsers are fantastic in this respect).
>
> If it is spinning, that is generally a sign that you have done
> something silly and have a rule that allow consumes no input and is
> repeated.  There is nothing obvious to me just from looking at the
> grammar.  You've got enough small problems that I'll let you sort
> those out first.  From what you've sent me, nothing would actually
> work, so it'll be hard to figure out what is really going on with
> precision.
>
> I'd split your input rule just for symmetry to be something like:
>
> input :
>     ID+ (NEW_LINE | EOF)
> ;
>
> pp_input:
>     DEFINE PP_ID+ (NEW_LINE | EOF)
> ;
>
> Then have a third rule like:
>
> stmt:
>    (input | pp_input)
> ;
>
> Modify the start to use it:
> start:
>    (stmt)* EOF;
>
> I like the symmetry, and I would guess it makes it easier to read the
> stack traces to figure out where you are (you can tell if it is
> attempting to parse the input as a input, or as a pre-processor
> input).
>
> Again, my suggestions very concisely are:
>
> 1. Split the Lexer and the Parser up.
> 2. Do not use the ability to have generated tokens in the parser (read
> remove '#define' from the parser rule.
> 3. Don't try and differentiate between the uppercase/lowercase stuff
> inside the lexer.
> 4. Run the lexer and make sure it generates exactly the tokens you think it
> is.
> 5. Run the parser in a debugger, recursive descent style parsers are
> how you would write a parser by hand, so they make a lot of sense when
> you step through them and the stack traces tell you exactly how it is
> attempting to match the input so far.
>
> Thanks,
>   Kirby
>
> PS:  Assembling all my instructions, except splitting the lexer and
> parser apart (and not running any of this, so if it breaks, keep the
> pieces):
>
> // Add EOF, to force all the input to be consumed, or generate an error.
> start:
>    stmt* EOF
>        ;
>
> stmt:
>    input|pp_input
> ;
>
> input
>        : ID NEW_LINE;
>        ;
>
> pp_input
>        :  DEFINE ID NEW_LINE
>        ;
>
> // NOTE: This is going to cause you problems if you have a '#', but
> not a '#define'
> // again I'm skipping that to get you past this point.
> DEFINE:
>    '#define'
> ;
>
> NEW_LINE
>        : '\r' '\n' | EOF
>        ;
> ID
>        : 'a'..'z'
>        | 'A'..'Z';
>
> // Add a whitespace rule somewhere in here...
>
> <snip...>
>

[antlr-interest] How do I accept input ending with a newline *or* EOF?

[antlr-interest] How do I accept input ending with a newline or EOF?