[antlr-interest] Have I found an Antlr CSharp3 lexer bug if...

Thu Jul 28 23:42:10 PDT 2011

Sam, thanks again for taking the time to put that together for me. I owe you
a beer. And if I had a deadline to meet I'd copy paste your code and move
on. But what did Terence say in his book? Why spend 5 minutes doing
something procedurally when you can spend 5 weeks doing it declaratively?
For better or for worse, I think I suffer from the same affliction. :) So
please allow me to play Devils Advocate for a second and try and propose a
declarative solution using both lexer and parser (and possibly requiring new
ANTLR declarations). And finally I'll hop up on my soap box and suggest that
such a solution is a worthy goal of ANTLR.

Basically the idea is to take what you've done procedurally, generalize it,
and expose it declaratively. To do this I'd introducing declarative syntax
which would allow lexer rules to be partitioned and then allow those
partitions to be toggled on and off based on context known to the parser.
Given that, I think the pre-processor could be done without procedural
logic. The partitioning syntax could be borrowed from C# custom attribute
syntax. So

// No declaration so in default partition
VERBATIM_STRING_LITERAL
  : '@"' F_VERBATIM_STRING_LITERAL_CHARACTER* '"'
  ;

[Partition("PP")] // Part of pre-processor partition
DEFINE: F_POUND_SIGN 'define';

[Partition("IfDefedOut")] // Only member of if-defed-out partition
PP_SKIPPED_CHARACTERS
  : ( ~(F_NEW_LINE_CHARACTER | F_POUND_SIGN) F_INPUT_CHARACTER* F_NEW_LINE
)*
  ;

Putting it together this is how it could solve the verbatim string problem:

#if false
@"
#endif

We start by partition our lexer rules into three groups. First (1) would be
pre-processor, second (2) C# grammer (including the verbatim string rule),
and third (3) would be the rule to "consume all text till the next pragma"
rule. After establishing the partitions we need to add logic to toggle them
on and off. We'd add that logic to the #if/#elif/#else parser rules. That
logic would detect that the #if false expression trivially evaluates to
false and would disable the (2) C# partition and activate the third (3)
"consume all text till the next pragma" partition thereby deactivating the
verbatim rule and allowing all text up to the #endif to be consumed.

Basically all this is repeating exactly what you've done procedurally. You
already established these groups in your solution. The first (1) is the set
of rules in the lexer, the second (2) is implemented separately and handed
tokens as they are produced by the pre-processor, and the third (3) is
implemented procedurally with the regexs. And you switch to that third (3)
partition and disable the second (2) at the point where you intercept the
the input stream. So both approaches are basically doing the same thing.

Finally, why would ANTLR want do do this? Well, because enabling this
scenario is right up ANTLRs design-philosophy-alley! ANTLR makes common
compiler building tasks simple. And it seems to me that pre-processors fall
into this bucket of
common-compiler-problems-that-could-have-simpler-solutions. And that the C#
pre-processor is so extremely basic, as you say, makes it ripe as a first
goal of a declarative solution. I know pre-processors have always been done
in the lexer but I got the feeling reading Terence's book that he was more
than happy to take a second look at common compiler problems and solve 'em
in different ways by introducing previously unheard of concepts (at least to
me): Lexers and parsers in one file. Semantic and syntactic
predicates. Infinite look ahead with back tracking. And my favorite example
of ANTLR simplifying the declarative syntax for a common parsing
problem is DELIMITED_COMMENT:
'/*' .* '*/'; Wow. Why wasn't it that easy with lex/yacc?! Maybe the time
has come for us to take a look at the idea that the pre-processor needs to
be done in the lexer and see if we can simplify that too!

So, what do you think? Can it be done?

And thanks again for your time and the VS tools!

Thanks,
Chris

On Thu, Jul 28, 2011 at 7:26 PM, Sam Harwell <sharwell at pixelminegames.com>wrote:

> Fortunately the C# preprocessor is extremely basic, so the task shouldn’t
> be hard at all. To start with, it’s important to understand that the
> preprocessor **must** be implemented with the lexer, because the following
> is valid:****
>
> ** **
>
> #if false****
>
> @"****
>
> #endif****
>
> ** **
>
> If the @" is processed by the lexer, it will consume the #endif as part of
> the verbatim string and mess everything up. Here’s what I would do:****
>
> ** **
>
> **·         **Implement a basic lexer rule to consume the characters
> following the #directive, up to but not including a single line comment
> marker //****
>
> **·         **Use a separate expression grammar to parse preprocessor
> expressions.****
>
> **·         **Set a flag in the lexer if the next code is excluded code.**
> **
>
> **·         **Override NextToken for the lexer, and if the flag is set to
> true, call out to a rule other than mTokens (a basic implementation of lexer
> modes).****
>
> ** **
>
> When I release version 3.4 of the runtime, the Lexer class has a new method
> ParseNextToken which can be overridden to help perform this task. I haven’t
> tested the following, but it’s what I would start with if I wanted to make a
> C# preprocessor.****
>
> ** **
>
> fragment PP_DEFINE:;****
>
> fragment PP_UNDEF:;****
>
> fragment PP_IF:;****
>
> fragment PP_ELSE:;****
>
> fragment PP_ENDIF:;****
>
> ** **
>
> PP_TOKEN****
>
>         :       {input.CharPositionInLine == 0}? =>****
>
>                 WS? '#' WS?****
>
>                 (       'define' {$type=PP_DEFINE;}****
>
>                 |       'undef' {$type=PP_UNDEF;}****
>
>                 |       'if'    {$type=PP_IF;}****
>
>                 |       'else'  {$type=PP_ELSE;}****
>
>                 |       'endif' {$type=PP_ENDIF;}****
>
>                 )****
>
>                 ~('\r' | '\n' | '/')*****
>
>         ;****
>
> ** **
>
> fragment****
>
> EXCLUDED_CODE****
>
>         :       PP_TOKEN****
>
>         |       (       WS****
>
>                 |       ~(' ' | '\t' | '#')+****
>
>                 )****
>
>                 {state.type = EXCLUDED_CODE; state.channel = Hidden;}****
>
>         ;****
>
> ** **
>
> WS****
>
>         :       (' ' | '\t')+****
>
>         ;****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> partial class CSharpLexer {****
>
> ** **
>
> private readonly HashSet<string> _definitions = new HashSet<string>(new
> string[] { "true" });****
>
> private readonly Stack<IncludedCodeState> _includedCode = new Stack<
> IncludedCodeState>(new IncludedCodeState[] { new IncludedCodeState(true,
> true) });****
>
> private bool _foundToken = false;****
>
> ** **
>
> public override IToken NextToken() {****
>
>     while (true) {****
>
>         IToken token = base.NextToken();****
>
> ** **
>
>         switch (token.Type) {****
>
>         case PP_DEFINE:****
>
>             if (_includedCode.Peek().IsIncluded)****
>
>             {****
>
>                 if (_foundToken)****
>
>                     throw new RecognitionException("Cannot define/undefine
> preprocessor symbols after first token in file");****
>
> ** **
>
>                 string name = token.Text;****
>
>                 name = name.Substring(name.IndexOf("define") + 6).Trim();*
> ***
>
>                 if (name == "true" || !Regex.IsMatch(name,
> @"^[A-Za-z_][\w]*$"))****
>
>                     throw new RecognitionException("Expected identifier in
> preprocessor.");****
>
> ** **
>
>                 _definitions.Add(name);****
>
>             }****
>
> ** **
>
>             continue;****
>
> ** **
>
>         case PP_UNDEF:****
>
>             if (_includedCode.Peek().IsIncluded)****
>
>             {****
>
>                 if (_foundToken)****
>
>                     throw new RecognitionException("Cannot define/undefine
> preprocessor symbols after first token in file");****
>
> ** **
>
>                 string name = token.Text;****
>
>                 name = name.Substring(name.IndexOf("undef") + 5).Trim();**
> **
>
>                 if (name == "true" || !Regex.IsMatch(name,
> @"^[A-Za-z_][\w]*$"))****
>
>                     throw new RecognitionException("Expected identifier in
> preprocessor.");****
>
> ** **
>
>                 _definitions.Remove(name);****
>
>             }****
>
> ** **
>
>             continue;****
>
> ** **
>
>         case PP_IF:****
>
>             {****
>
>                 string expression = token.Text;****
>
>                 expression = expression.Substring(expression.IndexOf("if")
> + 2).Trim();****
>
>                 _includedCode.Push(new IncludedCodeState(EvaluatePreprocessorExpression(expression),
> false));****
>
>             }****
>
>             continue;****
>
> ** **
>
>         case PP_ENDIF:****
>
>             if (_includedCode.Count == 1)****
>
>                 throw new RecognitionException("Mismatched #endif in
> preprocessor.");****
>
>             _includedCode.Pop();****
>
>             continue;****
>
> ** **
>
>         case PP_ELSE:****
>
>             if (_includedCode.Peek().FoundElseDirective)****
>
>                 throw new RecognitionException("Mismatched #else in
> preprocessor.");****
>
>             _includedCode.Push(_includedCode.Pop().ElseState);****
>
>             continue;****
>
> ** **
>
>         default:****
>
>             if (token.Channel == TokenChannels.Default)****
>
>                 _foundToken = true;****
>
>             return token;****
>
>         }****
>
>     }****
>
> }****
>
> ** **
>
> private bool? EvaluatePreprocessorExpression(string expression) {****
>
>     if (!_includedCode.Peek().IsIncluded)****
>
>         return null;****
>
>     throw new NotImplementedException("Use a very simple expression parser
> here to parse evaluate the Boolean expression.");****
>
> }****
>
> ** **
>
> protected override void ParseNextToken() {****
>
>     if (!_includedCode.Peek().IsIncluded)****
>
>         mEXCLUDED_CODE();****
>
>     else****
>
>         base.ParseNextToken();****
>
> }****
>
> ** **
>
> public struct IncludedCodeState {****
>
>     public readonly bool FoundElseDirective;****
>
>     private readonly bool? _isIncluded;****
>
> ** **
>
>     public IncludedCodeState(bool? isIncluded, bool foundElseDirective) {*
> ***
>
>         _isIncluded = isIncluded;****
>
>         FoundElseDirective = foundElseDirective;****
>
>     }****
>
> ** **
>
>     public bool IsIncluded { get { return _isIncluded ?? false; } }****
>
> ** **
>
>     public IncludedCodeState ElseState {****
>
>         get {****
>
>             if (_isIncluded == null)****
>
>                 return new IncludedCodeState(_isIncluded, true);****
>
>             return new IncludedCodeState(!_isIncluded, true);****
>
>         }****
>
>     }****
>
> }****
>
> }****
>
> ** **
>
> Sam****
>
> ** **
>
> *From:* chris king [mailto:kingces95 at gmail.com]
> *Sent:* Thursday, July 28, 2011 7:05 PM
> *To:* Sam Harwell
> *Cc:* antlr-interest at antlr.org
> *Subject:* Re: Have I found an Antlr CSharp3 lexer bug if...****
>
> ** **
>
> Sam, thanks so much for taking the time to look at that. If I could, let me
> try and explain what I'm trying to do and tell me if you think it's
> possible. For my own edification, I'm trying to implement a C# grammar. I'd
> like to implement the pre-processor at the moment. Implementations I've seen
> generally using only a lexer and use some type of trick to maintain a stack
> (e.g. for nested ifdefs and simple if/elif expressions). I figure why not
> use a parser to maintain the stack -- isn't that the reason
> for existence for parsers anyway? So that's what I'm trying to do -- use a
> lexer and parser to implement the pre-processor. ****
>
> ** **
>
> The big difficulty is changing the lexer rules depending on whether I'm in
> a #if def block that is active or not. I figured with ANTLR I'd be able to
> compute if the #ifdef block is active and then throw a switch to either
> parse tokens and hand those tokens off to the C# parser or consume and
> ignore all input up to the next pre-processor instruction thereby disabling
> that chunk of code. If I can do this then I could put the pre-processor and
> parser in the same file and construct the AST in one pass! Would that be
> cool? And clean? And maybe worth making a goal for ANTLR to be able to do?
> :)****
>
> ** **
>
> To be a bit more concrete: Here is the production for matching newline at
> the end of pre-processor instructions. The idea would be to enable
> PP_SKIPPED_CHARACTERS only if inside a disabling #ifdef block which would
> consume all characters till the next pre-processing instruction.****
>
> ** **
>
> pp_new_line****
>
>   : SINGLE_LINE_COMMENT? ((NEW_LINE! PP_SKIPPED_CHARACTERS) | EOF!)****
>
>   ;****
>
> ** **
>
> Here is what I was hoping would work as PP_SKIPPED_CHARACTERS. Unfortunately I
> don't seem to understand how to flip lexer rules on and off well enough to
> make this work...****
>
> ** **
>
> PP_SKIPPED_CHARACTERS****
>
>   : { IfDefedOut }? ( ~(F_NEW_LINE_CHARACTER | F_POUND_SIGN)
> F_INPUT_CHARACTER* F_NEW_LINE )*****
>
>   ;****
>
> ** **
>
> I hope that is enough to give you an idea of what I'm trying to do. This
> approach just seems so elegant to me (by which I mean almost all declarative
> -- no need to sprinkle procedural logic in among my productions to maintain
> a stack or whatever) that I'd hope that it would be do able in ANTLR. What
> do you think? Is it a worthy goal? Does it feel possible to you? If not, is
> a goal worth trying to achieve?****
>
> ** **
>
> Thanks,
> Chris****
>
> ** **
>
> ** **
>
> ** **
>
> On Thu, Jul 28, 2011 at 2:37 PM, Sam Harwell <sharwell at pixelminegames.com>
> wrote:****
>
> Hi Chris,****
>
>  ****
>
> Lookahead prediction occurs before predicates are evaluated. If fixed
> lookahead uniquely determines the alternative with a  semantic predicate,
> the predicate will not be evaluated as part of the decision process. I’m
> guessing (but not 100% sure) if you use a gated semantic predicate, then it
> will not be entering the rule:****
>
>  ****
>
> PP_SKIPPED_CHARACTERS****
>
>   : {false}? => ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER*
> F_NEW_LINE )*****
>
>   ;****
>
>  ****
>
> Also, a word of warning: this lexer rule can match a zero-length character
> span, which could result in an infinite loop. You should always ensure that
> every path through any lexer rule that’s not marked “fragment” will consume
> at least 1 character. There’s also a bug with certain exceptions in the
> lexer that can cause infinite loops – this has been resolved for release 3.4
> but I haven’t released it yet.****
>
>  ****
>
> Sam****
>
>  ****
>
> *From:* chris king [mailto:kingces95 at gmail.com]
> *Sent:* Thursday, July 28, 2011 4:19 PM
> *To:* antlr-interest at antlr.org; Sam Harwell
> *Subject:* Have I found an Antlr CSharp3 lexer bug if...****
>
>  ****
>
> Have I found an Antlr lexer CSharp3 bug if I can alter program execution
> (exception instead of no exception) by introducing a lexer production with a
> predicate that is always false? For example****
>
>  ****
>
> PP_SKIPPED_CHARACTERS****
>
>   : { false }? ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER*
> F_NEW_LINE )*****
>
>   ;****
>
>  ****
>
> I would think that such a production should always be ignored because it's
> predicate is always false and therefore would never alter program execution.
> Yet I'm seeing a change in the execution of my program. I'm seeing it enter
> this function and throw a FailedPredicateException. I wouldn't have expected
> that this function should ever even have been executed because the predicate
> is always false.****
>
>  ****
>
>      [GrammarRule("PP_SKIPPED_CHARACTERS")]****
>
>      private void mPP_SKIPPED_CHARACTERS()****
>
>      {****
>
>           EnterRule_PP_SKIPPED_CHARACTERS();****
>
>           EnterRule("PP_SKIPPED_CHARACTERS", 31);****
>
>           TraceIn("PP_SKIPPED_CHARACTERS", 31);****
>
>           try****
>
>           {****
>
>               int _type = PP_SKIPPED_CHARACTERS;****
>
>               int _channel = DefaultTokenChannel;****
>
>               // CSharp\\CSharpPreProcessor.g:197:3: ({...}? (~ (
> F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER )****
>
>               DebugEnterAlt(1);****
>
>               // CSharp\\CSharpPreProcessor.g:197:5: {...}? (~ (
> F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER )****
>
>               {****
>
>               DebugLocation(197, 5);****
>
>               if (!(( false )))****
>
>               {****
>
>                    throw new FailedPredicateException(input,
> "PP_SKIPPED_CHARACTERS", " False() ");****
>
>               }****
>
>  ****
>
> Sam, I'm on an all CSharp stack v3.3.1.7705. I'm using your VS plugin
> (which is wonderful) and build integration to generate the lexer/parser
> (also wonderful) and then running on top of your CSharp port of the runtime.
> If you think this is a bug and you'd like to have a look at the repro please
> let me know. The project is open source up on CodePlex. ****
>
>  ****
>
> Thanks,
> Chris****
>
> ** **
>