[antlr-interest] Have I found an Antlr CSharp3 lexer bug if...

Fri Jul 29 08:00:38 PDT 2011

For something simple like the C# preprocessor, it's easy to simply use lexer
modes. I'm pretty happy with the method I posted because it's allowed me to
integrate multiple lexers into a single "big" lexer almost as easily as it's
been to create lexer modes. For example, the following uses 4 lexers (HTML
text, HTML tags, PHP code, PHP doc comments):

http://visualstudiogallery.msdn.microsoft.com/2a10ba81-26c5-47d9-939b-6bcc7b
bec251/showImage/53676

The following uses 3 lexers (group, template text, template expression):

http://visualstudiogallery.msdn.microsoft.com/5ca30e58-96b4-4edf-b95e-3030da
f474ff/showImage/53651

Sam

From: chris king [mailto:kingces95 at gmail.com] 
Sent: Friday, July 29, 2011 1:42 AM
To: Sam Harwell
Cc: antlr-interest at antlr.org
Subject: Re: Have I found an Antlr CSharp3 lexer bug if...

Sam, thanks again for taking the time to put that together for me. I owe you
a beer. And if I had a deadline to meet I'd copy paste your code and move
on. But what did Terence say in his book? Why spend 5 minutes doing
something procedurally when you can spend 5 weeks doing it declaratively?
For better or for worse, I think I suffer from the same affliction. :) So
please allow me to play Devils Advocate for a second and try and propose a
declarative solution using both lexer and parser (and possibly requiring new
ANTLR declarations). And finally I'll hop up on my soap box and suggest that
such a solution is a worthy goal of ANTLR.

Basically the idea is to take what you've done procedurally, generalize it,
and expose it declaratively. To do this I'd introducing declarative syntax
which would allow lexer rules to be partitioned and then allow those
partitions to be toggled on and off based on context known to the parser.
Given that, I think the pre-processor could be done without procedural
logic. The partitioning syntax could be borrowed from C# custom attribute
syntax. So

// No declaration so in default partition

VERBATIM_STRING_LITERAL

  : '@"' F_VERBATIM_STRING_LITERAL_CHARACTER* '"'

  ;

[Partition("PP")] // Part of pre-processor partition

DEFINE: F_POUND_SIGN 'define';

[Partition("IfDefedOut")] // Only member of if-defed-out partition

PP_SKIPPED_CHARACTERS

  : ( ~(F_NEW_LINE_CHARACTER | F_POUND_SIGN) F_INPUT_CHARACTER* F_NEW_LINE
)*

  ;

Putting it together this is how it could solve the verbatim string problem:

#if false

@"

#endif

We start by partition our lexer rules into three groups. First (1) would be
pre-processor, second (2) C# grammer (including the verbatim string rule),
and third (3) would be the rule to "consume all text till the next pragma"
rule. After establishing the partitions we need to add logic to toggle them
on and off. We'd add that logic to the #if/#elif/#else parser rules. That
logic would detect that the #if false expression trivially evaluates to
false and would disable the (2) C# partition and activate the third (3)
"consume all text till the next pragma" partition thereby deactivating the
verbatim rule and allowing all text up to the #endif to be consumed.

Basically all this is repeating exactly what you've done procedurally. You
already established these groups in your solution. The first (1) is the set
of rules in the lexer, the second (2) is implemented separately and handed
tokens as they are produced by the pre-processor, and the third (3) is
implemented procedurally with the regexs. And you switch to that third (3)
partition and disable the second (2) at the point where you intercept the
the input stream. So both approaches are basically doing the same thing. 

Finally, why would ANTLR want do do this? Well, because enabling this
scenario is right up ANTLRs design-philosophy-alley! ANTLR makes common
compiler building tasks simple. And it seems to me that pre-processors fall
into this bucket of
common-compiler-problems-that-could-have-simpler-solutions. And that the C#
pre-processor is so extremely basic, as you say, makes it ripe as a first
goal of a declarative solution. I know pre-processors have always been done
in the lexer but I got the feeling reading Terence's book that he was more
than happy to take a second look at common compiler problems and solve 'em
in different ways by introducing previously unheard of concepts (at least to
me): Lexers and parsers in one file. Semantic and syntactic predicates.
Infinite look ahead with back tracking. And my favorite example of ANTLR
simplifying the declarative syntax for a common parsing problem is
DELIMITED_COMMENT: '/*' .* '*/'; Wow. Why wasn't it that easy with
lex/yacc?! Maybe the time has come for us to take a look at the idea that
the pre-processor needs to be done in the lexer and see if we can simplify
that too!

So, what do you think? Can it be done? 

And thanks again for your time and the VS tools!

Thanks,
Chris