[antlr-interest] Have I found an Antlr CSharp3 lexer bug if...

Fri Jul 29 09:59:52 PDT 2011

Yes - it is not too difficult but takes some thinking through. I have a
commercially available C# 3.x lexer/parser/tree walker that includes
pre-processing, if you were interested in that.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Sam Harwell
> Sent: Thursday, July 28, 2011 7:26 PM
> To: 'chris king'
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Have I found an Antlr CSharp3 lexer bug
> if...
>
> Fortunately the C# preprocessor is extremely basic, so the task
> shouldn't be hard at all. To start with, it's important to understand
> that the preprocessor *must* be implemented with the lexer, because the
> following is
> valid:
>
>
>
> #if false
>
> @"
>
> #endif
>
>
>
> If the @" is processed by the lexer, it will consume the #endif as part
> of the verbatim string and mess everything up. Here's what I would do:
>
>
>
> .         Implement a basic lexer rule to consume the characters
> following
> the #directive, up to but not including a single line comment marker //
>
> .         Use a separate expression grammar to parse preprocessor
> expressions.
>
> .         Set a flag in the lexer if the next code is excluded code.
>
> .         Override NextToken for the lexer, and if the flag is set to
> true,
> call out to a rule other than mTokens (a basic implementation of lexer
> modes).
>
>
>
> When I release version 3.4 of the runtime, the Lexer class has a new
> method ParseNextToken which can be overridden to help perform this
> task. I haven't tested the following, but it's what I would start with
> if I wanted to make a C# preprocessor.
>
>
>
> fragment PP_DEFINE:;
>
> fragment PP_UNDEF:;
>
> fragment PP_IF:;
>
> fragment PP_ELSE:;
>
> fragment PP_ENDIF:;
>
>
>
> PP_TOKEN
>
>         :       {input.CharPositionInLine == 0}? =>
>
>                 WS? '#' WS?
>
>                 (       'define' {$type=PP_DEFINE;}
>
>                 |       'undef' {$type=PP_UNDEF;}
>
>                 |       'if'    {$type=PP_IF;}
>
>                 |       'else'  {$type=PP_ELSE;}
>
>                 |       'endif' {$type=PP_ENDIF;}
>
>                 )
>
>                 ~('\r' | '\n' | '/')*
>
>         ;
>
>
>
> fragment
>
> EXCLUDED_CODE
>
>         :       PP_TOKEN
>
>         |       (       WS
>
>                 |       ~(' ' | '\t' | '#')+
>
>                 )
>
>                 {state.type = EXCLUDED_CODE; state.channel = Hidden;}
>
>         ;
>
>
>
> WS
>
>         :       (' ' | '\t')+
>
>         ;
>
>
>
>
>
>
>
>
>
> partial class CSharpLexer {
>
>
>
> private readonly HashSet<string> _definitions = new HashSet<string>(new
> string[] { "true" });
>
> private readonly Stack<IncludedCodeState> _includedCode = new
> Stack<IncludedCodeState>(new IncludedCodeState[] { new
> IncludedCodeState(true, true) });
>
> private bool _foundToken = false;
>
>
>
> public override IToken NextToken() {
>
>     while (true) {
>
>         IToken token = base.NextToken();
>
>
>
>         switch (token.Type) {
>
>         case PP_DEFINE:
>
>             if (_includedCode.Peek().IsIncluded)
>
>             {
>
>                 if (_foundToken)
>
>                     throw new RecognitionException("Cannot
> define/undefine preprocessor symbols after first token in file");
>
>
>
>                 string name = token.Text;
>
>                 name = name.Substring(name.IndexOf("define") +
> 6).Trim();
>
>                 if (name == "true" || !Regex.IsMatch(name,
> @"^[A-Za-z_][\w]*$"))
>
>                     throw new RecognitionException("Expected identifier
> in preprocessor.");
>
>
>
>                 _definitions.Add(name);
>
>             }
>
>
>
>             continue;
>
>
>
>         case PP_UNDEF:
>
>             if (_includedCode.Peek().IsIncluded)
>
>             {
>
>                 if (_foundToken)
>
>                     throw new RecognitionException("Cannot
> define/undefine preprocessor symbols after first token in file");
>
>
>
>                 string name = token.Text;
>
>                 name = name.Substring(name.IndexOf("undef") +
> 5).Trim();
>
>                 if (name == "true" || !Regex.IsMatch(name,
> @"^[A-Za-z_][\w]*$"))
>
>                     throw new RecognitionException("Expected identifier
> in preprocessor.");
>
>
>
>                 _definitions.Remove(name);
>
>             }
>
>
>
>             continue;
>
>
>
>         case PP_IF:
>
>             {
>
>                 string expression = token.Text;
>
>                 expression =
> expression.Substring(expression.IndexOf("if") + 2).Trim();
>
>                 _includedCode.Push(new
> IncludedCodeState(EvaluatePreprocessorExpression(expression), false));
>
>             }
>
>             continue;
>
>
>
>         case PP_ENDIF:
>
>             if (_includedCode.Count == 1)
>
>                 throw new RecognitionException("Mismatched #endif in
> preprocessor.");
>
>             _includedCode.Pop();
>
>             continue;
>
>
>
>         case PP_ELSE:
>
>             if (_includedCode.Peek().FoundElseDirective)
>
>                 throw new RecognitionException("Mismatched #else in
> preprocessor.");
>
>             _includedCode.Push(_includedCode.Pop().ElseState);
>
>             continue;
>
>
>
>         default:
>
>             if (token.Channel == TokenChannels.Default)
>
>                 _foundToken = true;
>
>             return token;
>
>         }
>
>     }
>
> }
>
>
>
> private bool? EvaluatePreprocessorExpression(string expression) {
>
>     if (!_includedCode.Peek().IsIncluded)
>
>         return null;
>
>     throw new NotImplementedException("Use a very simple expression
> parser here to parse evaluate the Boolean expression.");
>
> }
>
>
>
> protected override void ParseNextToken() {
>
>     if (!_includedCode.Peek().IsIncluded)
>
>         mEXCLUDED_CODE();
>
>     else
>
>         base.ParseNextToken();
>
> }
>
>
>
> public struct IncludedCodeState {
>
>     public readonly bool FoundElseDirective;
>
>     private readonly bool? _isIncluded;
>
>
>
>     public IncludedCodeState(bool? isIncluded, bool foundElseDirective)
> {
>
>         _isIncluded = isIncluded;
>
>         FoundElseDirective = foundElseDirective;
>
>     }
>
>
>
>     public bool IsIncluded { get { return _isIncluded ?? false; } }
>
>
>
>     public IncludedCodeState ElseState {
>
>         get {
>
>             if (_isIncluded == null)
>
>                 return new IncludedCodeState(_isIncluded, true);
>
>             return new IncludedCodeState(!_isIncluded, true);
>
>         }
>
>     }
>
> }
>
> }
>
>
>
> Sam
>
>
>
> From: chris king [mailto:kingces95 at gmail.com]
> Sent: Thursday, July 28, 2011 7:05 PM
> To: Sam Harwell
> Cc: antlr-interest at antlr.org
> Subject: Re: Have I found an Antlr CSharp3 lexer bug if...
>
>
>
> Sam, thanks so much for taking the time to look at that. If I could,
> let me try and explain what I'm trying to do and tell me if you think
> it's possible. For my own edification, I'm trying to implement a C#
> grammar. I'd like to implement the pre-processor at the moment.
> Implementations I've seen generally using only a lexer and use some
> type of trick to maintain a stack (e.g. for nested ifdefs and simple
> if/elif expressions). I figure why not use a parser to maintain the
> stack -- isn't that the reason for existence for parsers anyway? So
> that's what I'm trying to do -- use a lexer and parser to implement the
> pre-processor.
>
>
>
> The big difficulty is changing the lexer rules depending on whether I'm
> in a #if def block that is active or not. I figured with ANTLR I'd be
> able to compute if the #ifdef block is active and then throw a switch
> to either parse tokens and hand those tokens off to the C# parser or
> consume and ignore all input up to the next pre-processor instruction
> thereby disabling that chunk of code. If I can do this then I could put
> the pre-processor and parser in the same file and construct the AST in
> one pass! Would that be cool? And clean? And maybe worth making a goal
> for ANTLR to be able to do?
> :)
>
>
>
> To be a bit more concrete: Here is the production for matching newline
> at the end of pre-processor instructions. The idea would be to enable
> PP_SKIPPED_CHARACTERS only if inside a disabling #ifdef block which
> would consume all characters till the next pre-processing instruction.
>
>
>
> pp_new_line
>
>   : SINGLE_LINE_COMMENT? ((NEW_LINE! PP_SKIPPED_CHARACTERS) | EOF!)
>
>   ;
>
>
>
> Here is what I was hoping would work as PP_SKIPPED_CHARACTERS.
> Unfortunately I don't seem to understand how to flip lexer rules on and
> off well enough to make this work...
>
>
>
> PP_SKIPPED_CHARACTERS
>
>   : { IfDefedOut }? ( ~(F_NEW_LINE_CHARACTER | F_POUND_SIGN)
> F_INPUT_CHARACTER* F_NEW_LINE )*
>
>   ;
>
>
>
> I hope that is enough to give you an idea of what I'm trying to do.
> This approach just seems so elegant to me (by which I mean almost all
> declarative
> -- no need to sprinkle procedural logic in among my productions to
> maintain a stack or whatever) that I'd hope that it would be do able in
> ANTLR. What do you think? Is it a worthy goal? Does it feel possible to
> you? If not, is a goal worth trying to achieve?
>
>
>
> Thanks,
> Chris
>
>
>
>
>
>
>
> On Thu, Jul 28, 2011 at 2:37 PM, Sam Harwell
> <sharwell at pixelminegames.com>
> wrote:
>
> Hi Chris,
>
>
>
> Lookahead prediction occurs before predicates are evaluated. If fixed
> lookahead uniquely determines the alternative with a  semantic
> predicate, the predicate will not be evaluated as part of the decision
> process. I'm guessing (but not 100% sure) if you use a gated semantic
> predicate, then it will not be entering the rule:
>
>
>
> PP_SKIPPED_CHARACTERS
>
>   : {false}? => ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER*
> F_NEW_LINE )*
>
>   ;
>
>
>
> Also, a word of warning: this lexer rule can match a zero-length
> character span, which could result in an infinite loop. You should
> always ensure that every path through any lexer rule that's not marked
> "fragment" will consume at least 1 character. There's also a bug with
> certain exceptions in the lexer that can cause infinite loops - this
> has been resolved for release 3.4 but I haven't released it yet.
>
>
>
> Sam
>
>
>
> From: chris king [mailto:kingces95 at gmail.com]
> Sent: Thursday, July 28, 2011 4:19 PM
> To: antlr-interest at antlr.org; Sam Harwell
> Subject: Have I found an Antlr CSharp3 lexer bug if...
>
>
>
> Have I found an Antlr lexer CSharp3 bug if I can alter program
> execution (exception instead of no exception) by introducing a lexer
> production with a predicate that is always false? For example
>
>
>
> PP_SKIPPED_CHARACTERS
>
>   : { false }? ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER*
> F_NEW_LINE
> )*
>
>   ;
>
>
>
> I would think that such a production should always be ignored because
> it's predicate is always false and therefore would never alter program
> execution.
> Yet I'm seeing a change in the execution of my program. I'm seeing it
> enter this function and throw a FailedPredicateException. I wouldn't
> have expected that this function should ever even have been executed
> because the predicate is always false.
>
>
>
>      [GrammarRule("PP_SKIPPED_CHARACTERS")]
>
>      private void mPP_SKIPPED_CHARACTERS()
>
>      {
>
>           EnterRule_PP_SKIPPED_CHARACTERS();
>
>           EnterRule("PP_SKIPPED_CHARACTERS", 31);
>
>           TraceIn("PP_SKIPPED_CHARACTERS", 31);
>
>           try
>
>           {
>
>               int _type = PP_SKIPPED_CHARACTERS;
>
>               int _channel = DefaultTokenChannel;
>
>               // CSharp\\CSharpPreProcessor.g:197:3: ({...}? (~ (
> F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER )
>
>               DebugEnterAlt(1);
>
>               // CSharp\\CSharpPreProcessor.g:197:5: {...}? (~ (
> F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER )
>
>               {
>
>               DebugLocation(197, 5);
>
>               if (!(( false )))
>
>               {
>
>                    throw new FailedPredicateException(input,
> "PP_SKIPPED_CHARACTERS", " False() ");
>
>               }
>
>
>
> Sam, I'm on an all CSharp stack v3.3.1.7705. I'm using your VS plugin
> (which is wonderful) and build integration to generate the lexer/parser
> (also
> wonderful) and then running on top of your CSharp port of the runtime.
> If you think this is a bug and you'd like to have a look at the repro
> please let me know. The project is open source up on CodePlex.
>
>
>
> Thanks,
> Chris
>
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address