[antlr-interest] Have I found an Antlr CSharp3 lexer bug if...
Sam Harwell
sharwell at pixelminegames.com
Thu Jul 28 19:26:28 PDT 2011
Fortunately the C# preprocessor is extremely basic, so the task shouldn't be
hard at all. To start with, it's important to understand that the
preprocessor *must* be implemented with the lexer, because the following is
valid:
#if false
@"
#endif
If the @" is processed by the lexer, it will consume the #endif as part of
the verbatim string and mess everything up. Here's what I would do:
. Implement a basic lexer rule to consume the characters following
the #directive, up to but not including a single line comment marker //
. Use a separate expression grammar to parse preprocessor
expressions.
. Set a flag in the lexer if the next code is excluded code.
. Override NextToken for the lexer, and if the flag is set to true,
call out to a rule other than mTokens (a basic implementation of lexer
modes).
When I release version 3.4 of the runtime, the Lexer class has a new method
ParseNextToken which can be overridden to help perform this task. I haven't
tested the following, but it's what I would start with if I wanted to make a
C# preprocessor.
fragment PP_DEFINE:;
fragment PP_UNDEF:;
fragment PP_IF:;
fragment PP_ELSE:;
fragment PP_ENDIF:;
PP_TOKEN
: {input.CharPositionInLine == 0}? =>
WS? '#' WS?
( 'define' {$type=PP_DEFINE;}
| 'undef' {$type=PP_UNDEF;}
| 'if' {$type=PP_IF;}
| 'else' {$type=PP_ELSE;}
| 'endif' {$type=PP_ENDIF;}
)
~('\r' | '\n' | '/')*
;
fragment
EXCLUDED_CODE
: PP_TOKEN
| ( WS
| ~(' ' | '\t' | '#')+
)
{state.type = EXCLUDED_CODE; state.channel = Hidden;}
;
WS
: (' ' | '\t')+
;
partial class CSharpLexer {
private readonly HashSet<string> _definitions = new HashSet<string>(new
string[] { "true" });
private readonly Stack<IncludedCodeState> _includedCode = new
Stack<IncludedCodeState>(new IncludedCodeState[] { new
IncludedCodeState(true, true) });
private bool _foundToken = false;
public override IToken NextToken() {
while (true) {
IToken token = base.NextToken();
switch (token.Type) {
case PP_DEFINE:
if (_includedCode.Peek().IsIncluded)
{
if (_foundToken)
throw new RecognitionException("Cannot define/undefine
preprocessor symbols after first token in file");
string name = token.Text;
name = name.Substring(name.IndexOf("define") + 6).Trim();
if (name == "true" || !Regex.IsMatch(name,
@"^[A-Za-z_][\w]*$"))
throw new RecognitionException("Expected identifier in
preprocessor.");
_definitions.Add(name);
}
continue;
case PP_UNDEF:
if (_includedCode.Peek().IsIncluded)
{
if (_foundToken)
throw new RecognitionException("Cannot define/undefine
preprocessor symbols after first token in file");
string name = token.Text;
name = name.Substring(name.IndexOf("undef") + 5).Trim();
if (name == "true" || !Regex.IsMatch(name,
@"^[A-Za-z_][\w]*$"))
throw new RecognitionException("Expected identifier in
preprocessor.");
_definitions.Remove(name);
}
continue;
case PP_IF:
{
string expression = token.Text;
expression = expression.Substring(expression.IndexOf("if") +
2).Trim();
_includedCode.Push(new
IncludedCodeState(EvaluatePreprocessorExpression(expression), false));
}
continue;
case PP_ENDIF:
if (_includedCode.Count == 1)
throw new RecognitionException("Mismatched #endif in
preprocessor.");
_includedCode.Pop();
continue;
case PP_ELSE:
if (_includedCode.Peek().FoundElseDirective)
throw new RecognitionException("Mismatched #else in
preprocessor.");
_includedCode.Push(_includedCode.Pop().ElseState);
continue;
default:
if (token.Channel == TokenChannels.Default)
_foundToken = true;
return token;
}
}
}
private bool? EvaluatePreprocessorExpression(string expression) {
if (!_includedCode.Peek().IsIncluded)
return null;
throw new NotImplementedException("Use a very simple expression parser
here to parse evaluate the Boolean expression.");
}
protected override void ParseNextToken() {
if (!_includedCode.Peek().IsIncluded)
mEXCLUDED_CODE();
else
base.ParseNextToken();
}
public struct IncludedCodeState {
public readonly bool FoundElseDirective;
private readonly bool? _isIncluded;
public IncludedCodeState(bool? isIncluded, bool foundElseDirective) {
_isIncluded = isIncluded;
FoundElseDirective = foundElseDirective;
}
public bool IsIncluded { get { return _isIncluded ?? false; } }
public IncludedCodeState ElseState {
get {
if (_isIncluded == null)
return new IncludedCodeState(_isIncluded, true);
return new IncludedCodeState(!_isIncluded, true);
}
}
}
}
Sam
From: chris king [mailto:kingces95 at gmail.com]
Sent: Thursday, July 28, 2011 7:05 PM
To: Sam Harwell
Cc: antlr-interest at antlr.org
Subject: Re: Have I found an Antlr CSharp3 lexer bug if...
Sam, thanks so much for taking the time to look at that. If I could, let me
try and explain what I'm trying to do and tell me if you think it's
possible. For my own edification, I'm trying to implement a C# grammar. I'd
like to implement the pre-processor at the moment. Implementations I've seen
generally using only a lexer and use some type of trick to maintain a stack
(e.g. for nested ifdefs and simple if/elif expressions). I figure why not
use a parser to maintain the stack -- isn't that the reason for existence
for parsers anyway? So that's what I'm trying to do -- use a lexer and
parser to implement the pre-processor.
The big difficulty is changing the lexer rules depending on whether I'm in a
#if def block that is active or not. I figured with ANTLR I'd be able to
compute if the #ifdef block is active and then throw a switch to either
parse tokens and hand those tokens off to the C# parser or consume and
ignore all input up to the next pre-processor instruction thereby disabling
that chunk of code. If I can do this then I could put the pre-processor and
parser in the same file and construct the AST in one pass! Would that be
cool? And clean? And maybe worth making a goal for ANTLR to be able to do?
:)
To be a bit more concrete: Here is the production for matching newline at
the end of pre-processor instructions. The idea would be to enable
PP_SKIPPED_CHARACTERS only if inside a disabling #ifdef block which would
consume all characters till the next pre-processing instruction.
pp_new_line
: SINGLE_LINE_COMMENT? ((NEW_LINE! PP_SKIPPED_CHARACTERS) | EOF!)
;
Here is what I was hoping would work as PP_SKIPPED_CHARACTERS. Unfortunately
I don't seem to understand how to flip lexer rules on and off well enough to
make this work...
PP_SKIPPED_CHARACTERS
: { IfDefedOut }? ( ~(F_NEW_LINE_CHARACTER | F_POUND_SIGN)
F_INPUT_CHARACTER* F_NEW_LINE )*
;
I hope that is enough to give you an idea of what I'm trying to do. This
approach just seems so elegant to me (by which I mean almost all declarative
-- no need to sprinkle procedural logic in among my productions to maintain
a stack or whatever) that I'd hope that it would be do able in ANTLR. What
do you think? Is it a worthy goal? Does it feel possible to you? If not, is
a goal worth trying to achieve?
Thanks,
Chris
On Thu, Jul 28, 2011 at 2:37 PM, Sam Harwell <sharwell at pixelminegames.com>
wrote:
Hi Chris,
Lookahead prediction occurs before predicates are evaluated. If fixed
lookahead uniquely determines the alternative with a semantic predicate,
the predicate will not be evaluated as part of the decision process. I'm
guessing (but not 100% sure) if you use a gated semantic predicate, then it
will not be entering the rule:
PP_SKIPPED_CHARACTERS
: {false}? => ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER*
F_NEW_LINE )*
;
Also, a word of warning: this lexer rule can match a zero-length character
span, which could result in an infinite loop. You should always ensure that
every path through any lexer rule that's not marked "fragment" will consume
at least 1 character. There's also a bug with certain exceptions in the
lexer that can cause infinite loops - this has been resolved for release 3.4
but I haven't released it yet.
Sam
From: chris king [mailto:kingces95 at gmail.com]
Sent: Thursday, July 28, 2011 4:19 PM
To: antlr-interest at antlr.org; Sam Harwell
Subject: Have I found an Antlr CSharp3 lexer bug if...
Have I found an Antlr lexer CSharp3 bug if I can alter program execution
(exception instead of no exception) by introducing a lexer production with a
predicate that is always false? For example
PP_SKIPPED_CHARACTERS
: { false }? ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER* F_NEW_LINE
)*
;
I would think that such a production should always be ignored because it's
predicate is always false and therefore would never alter program execution.
Yet I'm seeing a change in the execution of my program. I'm seeing it enter
this function and throw a FailedPredicateException. I wouldn't have expected
that this function should ever even have been executed because the predicate
is always false.
[GrammarRule("PP_SKIPPED_CHARACTERS")]
private void mPP_SKIPPED_CHARACTERS()
{
EnterRule_PP_SKIPPED_CHARACTERS();
EnterRule("PP_SKIPPED_CHARACTERS", 31);
TraceIn("PP_SKIPPED_CHARACTERS", 31);
try
{
int _type = PP_SKIPPED_CHARACTERS;
int _channel = DefaultTokenChannel;
// CSharp\\CSharpPreProcessor.g:197:3: ({...}? (~ (
F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER )
DebugEnterAlt(1);
// CSharp\\CSharpPreProcessor.g:197:5: {...}? (~ (
F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER )
{
DebugLocation(197, 5);
if (!(( false )))
{
throw new FailedPredicateException(input,
"PP_SKIPPED_CHARACTERS", " False() ");
}
Sam, I'm on an all CSharp stack v3.3.1.7705. I'm using your VS plugin (which
is wonderful) and build integration to generate the lexer/parser (also
wonderful) and then running on top of your CSharp port of the runtime. If
you think this is a bug and you'd like to have a look at the repro please
let me know. The project is open source up on CodePlex.
Thanks,
Chris
More information about the antlr-interest
mailing list