[antlr-interest] Have I found an Antlr CSharp3 lexer bug if...

Gary R. Van Sickle g.r.vansickle at att.net
Fri Jul 29 06:00:31 PDT 2011


Hi Chris,

What you're describing sounds a lot like the "Start Condition" feature 
implemented in lex (flex has it as well).

Quoting 
<http://flex.sourceforge.net/manual/Start-Conditions.html#Start-Conditions>:

"
10 Start Conditions
flex provides a mechanism for conditionally activating rules. Any rule whose 
pattern is prefixed with `<sc>' will only be active when the scanner is in 
the start condition named sc. For example,


         <STRING>[^"]*        { /* eat up the string body ... */
                     ...
                     }


will be active only when the scanner is in the STRING start condition, and



         <INITIAL,STRING,QUOTE>\.        { /* handle an escape ... */
                     ...
                     }


will be active only when the current start condition is either INITIAL, 
STRING, or QUOTE.
"

I say "sounds a lot like" vs. "is" because you refer to "...based on context 
known to the parser", which I read as the parser state feeds back into the 
lexer and affects the lexing.  I think in a flex/bison setup, bison does 
have the ability to alter the flex start condition (and supports this sort 
of parser-to-lexer communication in general), but ANTLR has historically 
religiously avoided this, primarily (as I understand it) due to complexities 
involving getting LL(*) to work with a non-static token stream.

I think what you're suggesting[1] though is entirely doable in the lexer, 
and doesn't require feedback from the parser.  If ANTLR doesn't have 
something like start conditions, you could relatively easily write a flex 
preprocessor to pre-tokenize your input.

-- 
Gary R. Van Sickle
[1] DISCLAIMER: It's early, I'm tired, and I haven't been closely following 
this thread, so I may very well have misunderstood what your suggestion is 
;-).

-----Original Message----- 
From: chris king
Sent: Friday, July 29, 2011 1:42 AM
To: Sam Harwell
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Have I found an Antlr CSharp3 lexer bug if...

Sam, thanks again for taking the time to put that together for me. I owe you
a beer. And if I had a deadline to meet I'd copy paste your code and move
on. But what did Terence say in his book? Why spend 5 minutes doing
something procedurally when you can spend 5 weeks doing it declaratively?
For better or for worse, I think I suffer from the same affliction. :) So
please allow me to play Devils Advocate for a second and try and propose a
declarative solution using both lexer and parser (and possibly requiring new
ANTLR declarations). And finally I'll hop up on my soap box and suggest that
such a solution is a worthy goal of ANTLR.

Basically the idea is to take what you've done procedurally, generalize it,
and expose it declaratively. To do this I'd introducing declarative syntax
which would allow lexer rules to be partitioned and then allow those
partitions to be toggled on and off based on context known to the parser.
Given that, I think the pre-processor could be done without procedural
logic. The partitioning syntax could be borrowed from C# custom attribute
syntax. So

// No declaration so in default partition
VERBATIM_STRING_LITERAL
  : '@"' F_VERBATIM_STRING_LITERAL_CHARACTER* '"'
  ;


[Partition("PP")] // Part of pre-processor partition
DEFINE: F_POUND_SIGN 'define';

[Partition("IfDefedOut")] // Only member of if-defed-out partition
PP_SKIPPED_CHARACTERS
  : ( ~(F_NEW_LINE_CHARACTER | F_POUND_SIGN) F_INPUT_CHARACTER* F_NEW_LINE
)*
  ;

Putting it together this is how it could solve the verbatim string problem:

#if false
@"
#endif


We start by partition our lexer rules into three groups. First (1) would be
pre-processor, second (2) C# grammer (including the verbatim string rule),
and third (3) would be the rule to "consume all text till the next pragma"
rule. After establishing the partitions we need to add logic to toggle them
on and off. We'd add that logic to the #if/#elif/#else parser rules. That
logic would detect that the #if false expression trivially evaluates to
false and would disable the (2) C# partition and activate the third (3)
"consume all text till the next pragma" partition thereby deactivating the
verbatim rule and allowing all text up to the #endif to be consumed.

Basically all this is repeating exactly what you've done procedurally. You
already established these groups in your solution. The first (1) is the set
of rules in the lexer, the second (2) is implemented separately and handed
tokens as they are produced by the pre-processor, and the third (3) is
implemented procedurally with the regexs. And you switch to that third (3)
partition and disable the second (2) at the point where you intercept the
the input stream. So both approaches are basically doing the same thing.


Finally, why would ANTLR want do do this? Well, because enabling this
scenario is right up ANTLRs design-philosophy-alley! ANTLR makes common
compiler building tasks simple. And it seems to me that pre-processors fall
into this bucket of
common-compiler-problems-that-could-have-simpler-solutions. And that the C#
pre-processor is so extremely basic, as you say, makes it ripe as a first
goal of a declarative solution. I know pre-processors have always been done
in the lexer but I got the feeling reading Terence's book that he was more
than happy to take a second look at common compiler problems and solve 'em
in different ways by introducing previously unheard of concepts (at least to
me): Lexers and parsers in one file. Semantic and syntactic
predicates. Infinite look ahead with back tracking. And my favorite example
of ANTLR simplifying the declarative syntax for a common parsing
problem is DELIMITED_COMMENT:
'/*' .* '*/'; Wow. Why wasn't it that easy with lex/yacc?! Maybe the time
has come for us to take a look at the idea that the pre-processor needs to
be done in the lexer and see if we can simplify that too!

So, what do you think? Can it be done?

And thanks again for your time and the VS tools!

Thanks,
Chris



On Thu, Jul 28, 2011 at 7:26 PM, Sam Harwell 
<sharwell at pixelminegames.com>wrote:

> Fortunately the C# preprocessor is extremely basic, so the task shouldn’t
> be hard at all. To start with, it’s important to understand that the
> preprocessor **must** be implemented with the lexer, because the following
> is valid:****
>
> ** **
>
> #if false****
>
> @"****
>
> #endif****
>
> ** **
>
> If the @" is processed by the lexer, it will consume the #endif as part of
> the verbatim string and mess everything up. Here’s what I would do:****
>
> ** **
>
> **·         **Implement a basic lexer rule to consume the characters
> following the #directive, up to but not including a single line comment
> marker //****
>
> **·         **Use a separate expression grammar to parse preprocessor
> expressions.****
>
> **·         **Set a flag in the lexer if the next code is excluded code.**
> **
>
> **·         **Override NextToken for the lexer, and if the flag is set to
> true, call out to a rule other than mTokens (a basic implementation of 
> lexer
> modes).****
>
> ** **
>
> When I release version 3.4 of the runtime, the Lexer class has a new 
> method
> ParseNextToken which can be overridden to help perform this task. I haven’t
> tested the following, but it’s what I would start with if I wanted to make 
> a
> C# preprocessor.****
>
> ** **
>
> fragment PP_DEFINE:;****
>
> fragment PP_UNDEF:;****
>
> fragment PP_IF:;****
>
> fragment PP_ELSE:;****
>
> fragment PP_ENDIF:;****
>
> ** **
>
> PP_TOKEN****
>
>         :       {input.CharPositionInLine == 0}? =>****
>
>                 WS? '#' WS?****
>
>                 (       'define' {$type=PP_DEFINE;}****
>
>                 |       'undef' {$type=PP_UNDEF;}****
>
>                 |       'if'    {$type=PP_IF;}****
>
>                 |       'else'  {$type=PP_ELSE;}****
>
>                 |       'endif' {$type=PP_ENDIF;}****
>
>                 )****
>
>                 ~('\r' | '\n' | '/')*****
>
>         ;****
>
> ** **
>
> fragment****
>
> EXCLUDED_CODE****
>
>         :       PP_TOKEN****
>
>         |       (       WS****
>
>                 |       ~(' ' | '\t' | '#')+****
>
>                 )****
>
>                 {state.type = EXCLUDED_CODE; state.channel = Hidden;}****
>
>         ;****
>
> ** **
>
> WS****
>
>         :       (' ' | '\t')+****
>
>         ;****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> partial class CSharpLexer {****
>
> ** **
>
> private readonly HashSet<string> _definitions = new HashSet<string>(new
> string[] { "true" });****
>
> private readonly Stack<IncludedCodeState> _includedCode = new Stack<
> IncludedCodeState>(new IncludedCodeState[] { new IncludedCodeState(true,
> true) });****
>
> private bool _foundToken = false;****
>
> ** **
>
> public override IToken NextToken() {****
>
>     while (true) {****
>
>         IToken token = base.NextToken();****
>
> ** **
>
>         switch (token.Type) {****
>
>         case PP_DEFINE:****
>
>             if (_includedCode.Peek().IsIncluded)****
>
>             {****
>
>                 if (_foundToken)****
>
>                     throw new RecognitionException("Cannot define/undefine
> preprocessor symbols after first token in file");****
>
> ** **
>
>                 string name = token.Text;****
>
>                 name = name.Substring(name.IndexOf("define") + 6).Trim();*
> ***
>
>                 if (name == "true" || !Regex.IsMatch(name,
> @"^[A-Za-z_][\w]*$"))****
>
>                     throw new RecognitionException("Expected identifier in
> preprocessor.");****
>
> ** **
>
>                 _definitions.Add(name);****
>
>             }****
>
> ** **
>
>             continue;****
>
> ** **
>
>         case PP_UNDEF:****
>
>             if (_includedCode.Peek().IsIncluded)****
>
>             {****
>
>                 if (_foundToken)****
>
>                     throw new RecognitionException("Cannot define/undefine
> preprocessor symbols after first token in file");****
>
> ** **
>
>                 string name = token.Text;****
>
>                 name = name.Substring(name.IndexOf("undef") + 5).Trim();**
> **
>
>                 if (name == "true" || !Regex.IsMatch(name,
> @"^[A-Za-z_][\w]*$"))****
>
>                     throw new RecognitionException("Expected identifier in
> preprocessor.");****
>
> ** **
>
>                 _definitions.Remove(name);****
>
>             }****
>
> ** **
>
>             continue;****
>
> ** **
>
>         case PP_IF:****
>
>             {****
>
>                 string expression = token.Text;****
>
>                 expression = expression.Substring(expression.IndexOf("if")
> + 2).Trim();****
>
>                 _includedCode.Push(new 
> IncludedCodeState(EvaluatePreprocessorExpression(expression),
> false));****
>
>             }****
>
>             continue;****
>
> ** **
>
>         case PP_ENDIF:****
>
>             if (_includedCode.Count == 1)****
>
>                 throw new RecognitionException("Mismatched #endif in
> preprocessor.");****
>
>             _includedCode.Pop();****
>
>             continue;****
>
> ** **
>
>         case PP_ELSE:****
>
>             if (_includedCode.Peek().FoundElseDirective)****
>
>                 throw new RecognitionException("Mismatched #else in
> preprocessor.");****
>
>             _includedCode.Push(_includedCode.Pop().ElseState);****
>
>             continue;****
>
> ** **
>
>         default:****
>
>             if (token.Channel == TokenChannels.Default)****
>
>                 _foundToken = true;****
>
>             return token;****
>
>         }****
>
>     }****
>
> }****
>
> ** **
>
> private bool? EvaluatePreprocessorExpression(string expression) {****
>
>     if (!_includedCode.Peek().IsIncluded)****
>
>         return null;****
>
>     throw new NotImplementedException("Use a very simple expression parser
> here to parse evaluate the Boolean expression.");****
>
> }****
>
> ** **
>
> protected override void ParseNextToken() {****
>
>     if (!_includedCode.Peek().IsIncluded)****
>
>         mEXCLUDED_CODE();****
>
>     else****
>
>         base.ParseNextToken();****
>
> }****
>
> ** **
>
> public struct IncludedCodeState {****
>
>     public readonly bool FoundElseDirective;****
>
>     private readonly bool? _isIncluded;****
>
> ** **
>
>     public IncludedCodeState(bool? isIncluded, bool foundElseDirective) {*
> ***
>
>         _isIncluded = isIncluded;****
>
>         FoundElseDirective = foundElseDirective;****
>
>     }****
>
> ** **
>
>     public bool IsIncluded { get { return _isIncluded ?? false; } }****
>
> ** **
>
>     public IncludedCodeState ElseState {****
>
>         get {****
>
>             if (_isIncluded == null)****
>
>                 return new IncludedCodeState(_isIncluded, true);****
>
>             return new IncludedCodeState(!_isIncluded, true);****
>
>         }****
>
>     }****
>
> }****
>
> }****
>
> ** **
>
> Sam****
>
> ** **
>
> *From:* chris king [mailto:kingces95 at gmail.com]
> *Sent:* Thursday, July 28, 2011 7:05 PM
> *To:* Sam Harwell
> *Cc:* antlr-interest at antlr.org
> *Subject:* Re: Have I found an Antlr CSharp3 lexer bug if...****
>
> ** **
>
> Sam, thanks so much for taking the time to look at that. If I could, let 
> me
> try and explain what I'm trying to do and tell me if you think it's
> possible. For my own edification, I'm trying to implement a C# grammar. 
> I'd
> like to implement the pre-processor at the moment. Implementations I've 
> seen
> generally using only a lexer and use some type of trick to maintain a 
> stack
> (e.g. for nested ifdefs and simple if/elif expressions). I figure why not
> use a parser to maintain the stack -- isn't that the reason
> for existence for parsers anyway? So that's what I'm trying to do -- use a
> lexer and parser to implement the pre-processor. ****
>
> ** **
>
> The big difficulty is changing the lexer rules depending on whether I'm in
> a #if def block that is active or not. I figured with ANTLR I'd be able to
> compute if the #ifdef block is active and then throw a switch to either
> parse tokens and hand those tokens off to the C# parser or consume and
> ignore all input up to the next pre-processor instruction thereby 
> disabling
> that chunk of code. If I can do this then I could put the pre-processor 
> and
> parser in the same file and construct the AST in one pass! Would that be
> cool? And clean? And maybe worth making a goal for ANTLR to be able to do?
> :)****
>
> ** **
>
> To be a bit more concrete: Here is the production for matching newline at
> the end of pre-processor instructions. The idea would be to enable
> PP_SKIPPED_CHARACTERS only if inside a disabling #ifdef block which would
> consume all characters till the next pre-processing instruction.****
>
> ** **
>
> pp_new_line****
>
>   : SINGLE_LINE_COMMENT? ((NEW_LINE! PP_SKIPPED_CHARACTERS) | EOF!)****
>
>   ;****
>
> ** **
>
> Here is what I was hoping would work as PP_SKIPPED_CHARACTERS. 
> Unfortunately I
> don't seem to understand how to flip lexer rules on and off well enough to
> make this work...****
>
> ** **
>
> PP_SKIPPED_CHARACTERS****
>
>   : { IfDefedOut }? ( ~(F_NEW_LINE_CHARACTER | F_POUND_SIGN)
> F_INPUT_CHARACTER* F_NEW_LINE )*****
>
>   ;****
>
> ** **
>
> I hope that is enough to give you an idea of what I'm trying to do. This
> approach just seems so elegant to me (by which I mean almost all 
> declarative
> -- no need to sprinkle procedural logic in among my productions to 
> maintain
> a stack or whatever) that I'd hope that it would be do able in ANTLR. What
> do you think? Is it a worthy goal? Does it feel possible to you? If not, 
> is
> a goal worth trying to achieve?****
>
> ** **
>
> Thanks,
> Chris****
>
> ** **
>
> ** **
>
> ** **
>
> On Thu, Jul 28, 2011 at 2:37 PM, Sam Harwell <sharwell at pixelminegames.com>
> wrote:****
>
> Hi Chris,****
>
>  ****
>
> Lookahead prediction occurs before predicates are evaluated. If fixed
> lookahead uniquely determines the alternative with a  semantic predicate,
> the predicate will not be evaluated as part of the decision process. I’m
> guessing (but not 100% sure) if you use a gated semantic predicate, then 
> it
> will not be entering the rule:****
>
>  ****
>
> PP_SKIPPED_CHARACTERS****
>
>   : {false}? => ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER*
> F_NEW_LINE )*****
>
>   ;****
>
>  ****
>
> Also, a word of warning: this lexer rule can match a zero-length character
> span, which could result in an infinite loop. You should always ensure 
> that
> every path through any lexer rule that’s not marked “fragment” will 
> consume
> at least 1 character. There’s also a bug with certain exceptions in the
> lexer that can cause infinite loops – this has been resolved for release 
> 3.4
> but I haven’t released it yet.****
>
>  ****
>
> Sam****
>
>  ****
>
> *From:* chris king [mailto:kingces95 at gmail.com]
> *Sent:* Thursday, July 28, 2011 4:19 PM
> *To:* antlr-interest at antlr.org; Sam Harwell
> *Subject:* Have I found an Antlr CSharp3 lexer bug if...****
>
>  ****
>
> Have I found an Antlr lexer CSharp3 bug if I can alter program execution
> (exception instead of no exception) by introducing a lexer production with 
> a
> predicate that is always false? For example****
>
>  ****
>
> PP_SKIPPED_CHARACTERS****
>
>   : { false }? ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER*
> F_NEW_LINE )*****
>
>   ;****
>
>  ****
>
> I would think that such a production should always be ignored because it's
> predicate is always false and therefore would never alter program 
> execution.
> Yet I'm seeing a change in the execution of my program. I'm seeing it 
> enter
> this function and throw a FailedPredicateException. I wouldn't have 
> expected
> that this function should ever even have been executed because the 
> predicate
> is always false.****
>
>  ****
>
>      [GrammarRule("PP_SKIPPED_CHARACTERS")]****
>
>      private void mPP_SKIPPED_CHARACTERS()****
>
>      {****
>
>           EnterRule_PP_SKIPPED_CHARACTERS();****
>
>           EnterRule("PP_SKIPPED_CHARACTERS", 31);****
>
>           TraceIn("PP_SKIPPED_CHARACTERS", 31);****
>
>           try****
>
>           {****
>
>               int _type = PP_SKIPPED_CHARACTERS;****
>
>               int _channel = DefaultTokenChannel;****
>
>               // CSharp\\CSharpPreProcessor.g:197:3: ({...}? (~ (
> F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER )****
>
>               DebugEnterAlt(1);****
>
>               // CSharp\\CSharpPreProcessor.g:197:5: {...}? (~ (
> F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER )****
>
>               {****
>
>               DebugLocation(197, 5);****
>
>               if (!(( false )))****
>
>               {****
>
>                    throw new FailedPredicateException(input,
> "PP_SKIPPED_CHARACTERS", " False() ");****
>
>               }****
>
>  ****
>
> Sam, I'm on an all CSharp stack v3.3.1.7705. I'm using your VS plugin
> (which is wonderful) and build integration to generate the lexer/parser
> (also wonderful) and then running on top of your CSharp port of the 
> runtime.
> If you think this is a bug and you'd like to have a look at the repro 
> please
> let me know. The project is open source up on CodePlex. ****
>
>  ****
>
> Thanks,
> Chris****
>
> ** **
>

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address




More information about the antlr-interest mailing list