[antlr-interest] Have I found an Antlr CSharp3 lexer bug if...

Thu Jul 28 19:26:28 PDT 2011

Fortunately the C# preprocessor is extremely basic, so the task shouldn't be
hard at all. To start with, it's important to understand that the
preprocessor *must* be implemented with the lexer, because the following is
valid:

#if false

@"

#endif

If the @" is processed by the lexer, it will consume the #endif as part of
the verbatim string and mess everything up. Here's what I would do:

.         Implement a basic lexer rule to consume the characters following
the #directive, up to but not including a single line comment marker //

.         Use a separate expression grammar to parse preprocessor
expressions.

.         Set a flag in the lexer if the next code is excluded code.

.         Override NextToken for the lexer, and if the flag is set to true,
call out to a rule other than mTokens (a basic implementation of lexer
modes).

When I release version 3.4 of the runtime, the Lexer class has a new method
ParseNextToken which can be overridden to help perform this task. I haven't
tested the following, but it's what I would start with if I wanted to make a
C# preprocessor.

fragment PP_DEFINE:;

fragment PP_UNDEF:;

fragment PP_IF:;

fragment PP_ELSE:;

fragment PP_ENDIF:;

PP_TOKEN

        :       {input.CharPositionInLine == 0}? =>

                WS? '#' WS?

                (       'define' {$type=PP_DEFINE;}

                |       'undef' {$type=PP_UNDEF;}

                |       'if'    {$type=PP_IF;}

                |       'else'  {$type=PP_ELSE;}

                |       'endif' {$type=PP_ENDIF;}

                )

                ~('\r' | '\n' | '/')*

        ;

fragment

EXCLUDED_CODE

        :       PP_TOKEN

        |       (       WS

                |       ~(' ' | '\t' | '#')+

                )

                {state.type = EXCLUDED_CODE; state.channel = Hidden;}

        ;

WS

        :       (' ' | '\t')+

        ;

partial class CSharpLexer {

private readonly HashSet<string> _definitions = new HashSet<string>(new
string[] { "true" });

private readonly Stack<IncludedCodeState> _includedCode = new
Stack<IncludedCodeState>(new IncludedCodeState[] { new
IncludedCodeState(true, true) });

private bool _foundToken = false;

public override IToken NextToken() {

    while (true) {

        IToken token = base.NextToken();

        switch (token.Type) {

        case PP_DEFINE:

            if (_includedCode.Peek().IsIncluded)

            {

                if (_foundToken)

                    throw new RecognitionException("Cannot define/undefine
preprocessor symbols after first token in file");

                string name = token.Text;

                name = name.Substring(name.IndexOf("define") + 6).Trim();

                if (name == "true" || !Regex.IsMatch(name,
@"^[A-Za-z_][\w]*$"))

                    throw new RecognitionException("Expected identifier in
preprocessor.");

                _definitions.Add(name);

            }

            continue;

        case PP_UNDEF:

            if (_includedCode.Peek().IsIncluded)

            {

                if (_foundToken)

                    throw new RecognitionException("Cannot define/undefine
preprocessor symbols after first token in file");

                string name = token.Text;

                name = name.Substring(name.IndexOf("undef") + 5).Trim();

                if (name == "true" || !Regex.IsMatch(name,
@"^[A-Za-z_][\w]*$"))

                    throw new RecognitionException("Expected identifier in
preprocessor.");

                _definitions.Remove(name);

            }

            continue;

        case PP_IF:

            {

                string expression = token.Text;

                expression = expression.Substring(expression.IndexOf("if") +
2).Trim();

                _includedCode.Push(new
IncludedCodeState(EvaluatePreprocessorExpression(expression), false));

            }

            continue;

        case PP_ENDIF:

            if (_includedCode.Count == 1)

                throw new RecognitionException("Mismatched #endif in
preprocessor.");

            _includedCode.Pop();

            continue;

        case PP_ELSE:

            if (_includedCode.Peek().FoundElseDirective)

                throw new RecognitionException("Mismatched #else in
preprocessor.");

            _includedCode.Push(_includedCode.Pop().ElseState);

            continue;

        default:

            if (token.Channel == TokenChannels.Default)

                _foundToken = true;

            return token;

        }

    }

}

private bool? EvaluatePreprocessorExpression(string expression) {

    if (!_includedCode.Peek().IsIncluded)

        return null;

    throw new NotImplementedException("Use a very simple expression parser
here to parse evaluate the Boolean expression.");

}

protected override void ParseNextToken() {

    if (!_includedCode.Peek().IsIncluded)

        mEXCLUDED_CODE();

    else

        base.ParseNextToken();

}

public struct IncludedCodeState {

    public readonly bool FoundElseDirective;

    private readonly bool? _isIncluded;

    public IncludedCodeState(bool? isIncluded, bool foundElseDirective) {

        _isIncluded = isIncluded;

        FoundElseDirective = foundElseDirective;

    }

    public bool IsIncluded { get { return _isIncluded ?? false; } }

    public IncludedCodeState ElseState {

        get {

            if (_isIncluded == null)

                return new IncludedCodeState(_isIncluded, true);

            return new IncludedCodeState(!_isIncluded, true);

        }

    }

}

}

Sam

From: chris king [mailto:kingces95 at gmail.com] 
Sent: Thursday, July 28, 2011 7:05 PM
To: Sam Harwell
Cc: antlr-interest at antlr.org
Subject: Re: Have I found an Antlr CSharp3 lexer bug if...

Sam, thanks so much for taking the time to look at that. If I could, let me
try and explain what I'm trying to do and tell me if you think it's
possible. For my own edification, I'm trying to implement a C# grammar. I'd
like to implement the pre-processor at the moment. Implementations I've seen
generally using only a lexer and use some type of trick to maintain a stack
(e.g. for nested ifdefs and simple if/elif expressions). I figure why not
use a parser to maintain the stack -- isn't that the reason for existence
for parsers anyway? So that's what I'm trying to do -- use a lexer and
parser to implement the pre-processor. 

The big difficulty is changing the lexer rules depending on whether I'm in a
#if def block that is active or not. I figured with ANTLR I'd be able to
compute if the #ifdef block is active and then throw a switch to either
parse tokens and hand those tokens off to the C# parser or consume and
ignore all input up to the next pre-processor instruction thereby disabling
that chunk of code. If I can do this then I could put the pre-processor and
parser in the same file and construct the AST in one pass! Would that be
cool? And clean? And maybe worth making a goal for ANTLR to be able to do?
:)

To be a bit more concrete: Here is the production for matching newline at
the end of pre-processor instructions. The idea would be to enable
PP_SKIPPED_CHARACTERS only if inside a disabling #ifdef block which would
consume all characters till the next pre-processing instruction.

pp_new_line

  : SINGLE_LINE_COMMENT? ((NEW_LINE! PP_SKIPPED_CHARACTERS) | EOF!)

  ;

Here is what I was hoping would work as PP_SKIPPED_CHARACTERS. Unfortunately
I don't seem to understand how to flip lexer rules on and off well enough to
make this work...

PP_SKIPPED_CHARACTERS

  : { IfDefedOut }? ( ~(F_NEW_LINE_CHARACTER | F_POUND_SIGN)
F_INPUT_CHARACTER* F_NEW_LINE )*

  ;

I hope that is enough to give you an idea of what I'm trying to do. This
approach just seems so elegant to me (by which I mean almost all declarative
-- no need to sprinkle procedural logic in among my productions to maintain
a stack or whatever) that I'd hope that it would be do able in ANTLR. What
do you think? Is it a worthy goal? Does it feel possible to you? If not, is
a goal worth trying to achieve?

Thanks,
Chris

On Thu, Jul 28, 2011 at 2:37 PM, Sam Harwell <sharwell at pixelminegames.com>
wrote:

Hi Chris,

Lookahead prediction occurs before predicates are evaluated. If fixed
lookahead uniquely determines the alternative with a  semantic predicate,
the predicate will not be evaluated as part of the decision process. I'm
guessing (but not 100% sure) if you use a gated semantic predicate, then it
will not be entering the rule:

PP_SKIPPED_CHARACTERS

  : {false}? => ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER*
F_NEW_LINE )*

  ;

Also, a word of warning: this lexer rule can match a zero-length character
span, which could result in an infinite loop. You should always ensure that
every path through any lexer rule that's not marked "fragment" will consume
at least 1 character. There's also a bug with certain exceptions in the
lexer that can cause infinite loops - this has been resolved for release 3.4
but I haven't released it yet.

Sam

From: chris king [mailto:kingces95 at gmail.com] 
Sent: Thursday, July 28, 2011 4:19 PM
To: antlr-interest at antlr.org; Sam Harwell
Subject: Have I found an Antlr CSharp3 lexer bug if...

Have I found an Antlr lexer CSharp3 bug if I can alter program execution
(exception instead of no exception) by introducing a lexer production with a
predicate that is always false? For example

PP_SKIPPED_CHARACTERS

  : { false }? ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER* F_NEW_LINE
)*

  ;

I would think that such a production should always be ignored because it's
predicate is always false and therefore would never alter program execution.
Yet I'm seeing a change in the execution of my program. I'm seeing it enter
this function and throw a FailedPredicateException. I wouldn't have expected
that this function should ever even have been executed because the predicate
is always false.

     [GrammarRule("PP_SKIPPED_CHARACTERS")]

     private void mPP_SKIPPED_CHARACTERS()

     {

          EnterRule_PP_SKIPPED_CHARACTERS();

          EnterRule("PP_SKIPPED_CHARACTERS", 31);

          TraceIn("PP_SKIPPED_CHARACTERS", 31);

          try

          {

              int _type = PP_SKIPPED_CHARACTERS;

              int _channel = DefaultTokenChannel;

              // CSharp\\CSharpPreProcessor.g:197:3: ({...}? (~ (
F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER )

              DebugEnterAlt(1);

              // CSharp\\CSharpPreProcessor.g:197:5: {...}? (~ (
F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER )

              {

              DebugLocation(197, 5);

              if (!(( false )))

              {

                   throw new FailedPredicateException(input,
"PP_SKIPPED_CHARACTERS", " False() ");

              }

Sam, I'm on an all CSharp stack v3.3.1.7705. I'm using your VS plugin (which
is wonderful) and build integration to generate the lexer/parser (also
wonderful) and then running on top of your CSharp port of the runtime. If
you think this is a bug and you'd like to have a look at the repro please
let me know. The project is open source up on CodePlex. 

Thanks,
Chris