[antlr-interest] C# parser grammar problem

Terence Parr parrt at cs.usfca.edu
Tue Mar 6 10:44:50 PST 2007


Hi.  That line in the code indicates a malformed \uxxxx cha ref.  Do  
you see one in your code?

Ter
On Mar 6, 2007, at 9:33 AM, Johannes Luber wrote:

> Hello,
>
> I've converted all the rules in chapter 9 of the Ecma334-PDF, so I
> wanted to check, if I wrote the grammar correctly so far. The grammar
> check is successful, but still I can't generate the corresponding java
> files. The console spits the following out exception out:
>
> java.lang.StringIndexOutOfBoundsException: String index out of  
> range: 7
> 	at java.lang.String.substring(Unknown Source)
> 	at
> org.antlr.tool.Grammar.getUnescapedStringFromGrammarStringLiteral 
> (Grammar.java:1432)
> 	at org.antlr.tool.ANTLRLexer.mCHAR_LITERAL(ANTLRLexer.java:957)
> 	at org.antlr.tool.ANTLRLexer.nextToken(ANTLRLexer.java:215)
> 	at
> antlr.TokenStreamRewriteEngine.nextToken 
> (TokenStreamRewriteEngine.java:161)
> 	at antlr.TokenBuffer.fill(TokenBuffer.java:69)
> 	at antlr.TokenBuffer.LA(TokenBuffer.java:80)
> 	at antlr.LLkParser.LA(LLkParser.java:52)
> 	at org.antlr.tool.ANTLRParser.ruleScopeSpec(ANTLRParser.java:1509)
> 	at org.antlr.tool.ANTLRParser.rule(ANTLRParser.java:1310)
> 	at org.antlr.tool.ANTLRParser.rules(ANTLRParser.java:702)
> 	at org.antlr.tool.ANTLRParser.grammar(ANTLRParser.java:392)
> 	at org.antlr.tool.Grammar.setGrammarContent(Grammar.java:507)
> 	at org.antlr.tool.Grammar.setGrammarContent(Grammar.java:484)
> 	at org.antlr.works.grammar.EngineGrammar.createNewGrammar(Unknown  
> Source)
> 	at org.antlr.works.grammar.EngineGrammar.createCombinedGrammar 
> (Unknown
> Source)
> 	at org.antlr.works.grammar.EngineGrammar.createGrammars(Unknown  
> Source)
> 	at org.antlr.works.grammar.EngineGrammar.getParserGrammar(Unknown  
> Source)
> 	at org.antlr.works.generate.CodeGenerate.getGrammarLanguage 
> (Unknown Source)
> 	at org.antlr.works.menu.MenuGenerate.isKnownLanguage(Unknown Source)
> 	at org.antlr.works.menu.MenuGenerate.checkLanguage(Unknown Source)
> 	at
> org.antlr.works.menu.MenuGenerate.generateCodeProcessContinued(Unknown
> Source)
> 	at org.antlr.works.menu.MenuGenerate.checkGrammarDidEnd(Unknown  
> Source)
> 	at org.antlr.works.grammar.CheckGrammar.run(Unknown Source)
> 	at java.lang.Thread.run(Unknown Source)
>
> I have no idea, where my mistake could lie. I hope that someone can  
> shed
> some light onto this. The grammar is attached to the email.
>
> Thanks in advance,
> Johannes Luber
> /* By Johannes Luber, 2007. All rights reserved.
>
> Converted original grammar in Ecma 334 into ANTLR syntax, removed  
> left recursion and
> collapsed rules like A:B?, B: C+ into A: C*.
>
> TBD: Convert rules containing only token references like 'd' or 'a'  
> |'b' in lexer rules (ALL_UPPER_CASE)
>
> */
>
> grammar CSharp3;
>
> // Grammar Ambiguities described in §9.2.3 in Ecma 334
>
> // Intrinsic Datatypes: object, string, bool, char, decimal, sbyte,  
> short,
> // int, long, byte, ushort, unit, ulong, float, double
>
> options {
> 	language=CSharp;
> 	output=template;
> 	//namespace	= "CSharpML.CSharpParser";
> }
>
> @header {
>
> }
>
> input
> 	:	input_section*
> 	;
>
>
> input_section
> 	:	input_element* NEW_LINE
> 	|	pp_directive
> 	;
>
>
> input_element
> 	:	whitespace
> 	|	comment
> 	|	token
> 	;
>
> whitespace
> 	:	WHITESPACE_CHARACTER*
> 	;
>
> fragment WHITESPACE_CHARACTER
> 	:	UNICODE_CLASS_Zs
> 	|	'\u0009' // Horizontal tab character
> 	|	'\u000B' // Vertical tab character
> 	|	'\u000C' // Form feed character
> 	;
>
> NEW_LINE
> 	:	'\u000D' // Carriage return character
> 	|	'\u000A' // Line feed character
> 	|	'\u000D\u000A' // Carriage return character followed by line  
> feed character
> 	|	'\u2085' // Next line character
> 	|	'\u2028' // Line separator character
> 	|	'\u2029' // Paragraph separator character
> 	;
> 	
> comment
> 	:	single_line_comment
> 	|	delimited_comment
> 	;
> 	
> single_line_comment
> 	:	'//' INPUT_CHARACTER*
> 	;
> 	
> 	
> fragment INPUT_CHARACTER
> 	:	~NEW_LINE_CHARACTER // Any Unicode character except a  
> new_line_character
> 	;
>
> NEW_LINE_CHARACTER
> 	:	'\u000D' // Carriage return character
> 	|	'\u000A' // Line feed character
> 	|	'\u0085' // Next line character
> 	|	'\u2028' // Line separator character
> 	|	'\u2029' // Paragraph separator character
> 	;
> 	
> delimited_comment
> 	:	'/*' DELIMITED_COMMENT_SECTION* ASTERISKS '/'
> 	;
> 		
> fragment DELIMITED_COMMENT_SECTION
> 	:	NOT_ASTERISK
> 	|	ASTERISKS NOT_SLASH
> 	;
> 	
> fragment ASTERISKS
> 	:	('*') ('*')*
> 	;
> 	
> fragment NOT_ASTERISK
> 	:	~'*' // Any Unicode character except *
> 	;
> 	
> fragment NOT_SLASH
> 	:	~'/' // Any Unicode character except /
> 	;
> 	
> fragment UNICODE_CLASS_Zs // Any character with Unicode class Zs  
> (18 characters known)
> 	:	'\u0020' // SPACE
> 	|	'\u00A0' // NO_BREAK SPACE
> 	|	'\u1680' // OGHAM SPACE MARK
> 	|	'\u180E' // MONGOLIAN VOWEL SEPARATOR
> 	|	'\u2000' // EN QUAD
> 	|	'\u2001' // EM QUAD
> 	|	'\u2002' // EN SPACE
> 	|	'\u2003' // EM SPACE
> 	|	'\u2004' // THREE_PER_EM SPACE
> 	|	'\u2005' // FOUR_PER_EM SPACE
> 	|	'\u2006' // SIX_PER_EM SPACE
> 	|	'\u2008' // PUNCTUATION SPACE
> 	|	'\u2009' // THIN SPACE
> 	|	'\u200A' // HAIR SPACE
> 	|	'\u202F' // NARROW NO_BREAK SPACE
> 	|	'\u3000' // IDEOGRAPHIC SPACE
> 	|	'\u205F' // MEDIUM MATHEMATICAL SPACE
> 	;
>
> // TBD: Inclusion of all uppercase letter characters. Replace this  
> rule with the one in UnicodeClassLu.g.
> fragment UNICODE_CLASS_Lu
> 	:	'\u0041'..'\u005A' // LATIN CAPITAL LETTER A_Z
> 	|	'\u00C0'..'\u00DE' // ACCENTED CAPITAL LETTERS
> 	;
>
> // TBD: Inclusion of all lowercase letter characters. Replace this  
> rule with the one in UnicodeClassLl.g.
> fragment UNICODE_CLASS_Ll
> 	:	'\u0061'..'\u007A' // LATIN SMALL LETTER a_z
> 	;
>
> // TBD: Inclusion of all titlecase letter characters. Replace this  
> rule with the one in UnicodeClassLt.g.
> fragment UNICODE_CLASS_Lt
> 	:	'\u01C5' // LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
> 	|	'\u01C8' // LATIN CAPITAL LETTER L WITH SMALL LETTER J
> 	|	'\u01CB' // LATIN CAPITAL LETTER N WITH SMALL LETTER J
> 	|	'\u01F2' // LATIN CAPITAL LETTER D WITH SMALL LETTER Z
> 	;
>
> // TBD: Inclusion of all modifier letter characters. Replace this  
> rule with the one in UnicodeClassLm.g.
> fragment UNICODE_CLASS_Lm
> 	:	'\u02B0'..'\u02EE' // MODIFIER LETTERS
> 	;
>
> // TBD: Inclusion of all other letter characters. Replace this rule  
> with the one in UnicodeClassLo.g.
> fragment UNICODE_CLASS_Lo
> 	:	'\u01BB' // LATIN LETTER TWO WITH STROKE
> 	|	'\u01C0' // LATIN LETTER DENTAL CLICK
> 	|	'\u01C1' // LATIN LETTER LATERAL CLICK
> 	|	'\u01C2' // LATIN LETTER ALVEOLAR CLICK
> 	|	'\u01C3' // LATIN LETTER RETROFLEX CLICK
> 	|	'\u0294' // LATIN LETTER GLOTTAL STOP
> 	;
>
> // TBD: Inclusion of all uppercase letter characters. Replace this  
> rule with the one in UnicodeClassNl.g.
> fragment UNICODE_CLASS_Nl
> 	:	'\u16EE' // RUNIC ARLAUG SYMBOL
> 	|	'\u16EF' // RUNIC TVIMADUR SYMBOL
> 	|	'\u16F0' // RUNIC BELGTHOR SYMBOL
> 	|	'\u2160' // ROMAN NUMERAL ONE
> 	|	'\u2161' // ROMAN NUMERAL TWO
> 	|	'\u2162' // ROMAN NUMERAL THREE
> 	|	'\u2163' // ROMAN NUMERAL FOUR
> 	|	'\u2164' // ROMAN NUMERAL FIVE
> 	|	'\u2165' // ROMAN NUMERAL SIX
> 	|	'\u2166' // ROMAN NUMERAL SEVEN
> 	|	'\u2167' // ROMAN NUMERAL EIGHT
> 	|	'\u2168' // ROMAN NUMERAL NINE
> 	|	'\u2169' // ROMAN NUMERAL TEN
> 	|	'\u216A' // ROMAN NUMERAL ELEVEN
> 	|	'\u216B' // ROMAN NUMERAL TWELVE
> 	|	'\u216C' // ROMAN NUMERAL FIFTY
> 	|	'\u216D' // ROMAN NUMERAL ONE HUNDRED
> 	|	'\u216E' // ROMAN NUMERAL FIVE HUNDRED
> 	|	'\u216F' // ROMAN NUMERAL ONE THOUSAND
> 	;
>
> // TBD: Inclusion of all uppercase letter characters. Replace this  
> rule with the one in UnicodeClassMn.g.
> fragment UNICODE_CLASS_Mn
> 	:	'\u0300' // COMBINING GRAVE ACCENT
> 	|	'\u0301' // COMBINING ACUTE ACCENT
> 	|	'\u0302' // COMBINING CIRCUMFLEX ACCENT
> 	|	'\u0303' // COMBINING TILDE
> 	|	'\u0304' // COMBINING MACRON
> 	|	'\u0305' // COMBINING OVERLINE
> 	|	'\u0306' // COMBINING BREVE
> 	|	'\u0307' // COMBINING DOT ABOVE
> 	|	'\u0308' // COMBINING DIAERESIS
> 	|	'\u0309' // COMBINING HOOK ABOVE
> 	|	'\u030A' // COMBINING RING ABOVE
> 	|	'\u030B' // COMBINING DOUBLE ACUTE ACCENT
> 	|	'\u030C' // COMBINING CARON
> 	|	'\u030D' // COMBINING VERTICAL LINE ABOVE
> 	|	'\u030E' // COMBINING DOUBLE VERTICAL LINE ABOVE
> 	|	'\u030F' // COMBINING DOUBLE GRAVE ACCENT
> 	|	'\u0310' // COMBINING CANDRABINDU
> 	;
>
> // TBD: Inclusion of all uppercase letter characters. Replace this  
> rule with the one in UnicodeClassMc.g.
> fragment UNICODE_CLASS_Mc
> 	:	'\u0903' // DEVANAGARI SIGN VISARGA
> 	|	'\u093E' // DEVANAGARI VOWEL SIGN AA
> 	|	'\u093F' // DEVANAGARI VOWEL SIGN I
> 	|	'\u0940' // DEVANAGARI VOWEL SIGN II
> 	|	'\u0949' // DEVANAGARI VOWEL SIGN CANDRA O
> 	|	'\u094A' // DEVANAGARI VOWEL SIGN SHORT O
> 	|	'\u094B' // DEVANAGARI VOWEL SIGN O
> 	|	'\u094C' // DEVANAGARI VOWEL SIGN AU
> 	;
>
> // TBD: Inclusion of all uppercase letter characters. Replace this  
> rule with the one in UnicodeClassCf.g.
> fragment UNICODE_CLASS_Cf
> 	:	'\u00AD' // SOFT HYPHEN
> 	|	'\u0600' // ARABIC NUMBER SIGN
> 	|	'\u0601' // ARABIC SIGN SANAH
> 	|	'\u0602' // ARABIC FOOTNOTE MARKER
> 	|	'\u0603' // ARABIC SIGN SAFHA
> 	|	'\u06DD' // ARABIC END OF AYAH
> 	;
>
> // This definition contains all known characters
> fragment UNICODE_CLASS_Pc
> 	:	'\u005F' // LOW LINE
> 	|	'\u203F' // UNDERTIE
> 	|	'\u2040' // CHARACTER TIE
> 	|	'\u2054' // INVERTED UNDERTIE
> 	|	'\uFE33' // PRESENTATION FORM FOR VERTICAL LOW LINE
> 	|	'\uFE34' // PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
> 	|	'\uFE4D' // DASHED LOW LINE
> 	|	'\uFE4E' // CENTRELINE LOW LINE
> 	|	'\uFE4F' // WAVY LOW LINE
> 	|	'\uFF3F' // FULLWIDTH LOW LINE
> 	;
>
> // TBD: Inclusion of all uppercase letter characters. Replace this  
> rule with the one in UnicodeClassNd.g.
> fragment UNICODE_CLASS_Nd
> 	:	'\u0030' // DIGIT ZERO
> 	|	'\u0031' // DIGIT ONE
> 	|	'\u0032' // DIGIT TWO
> 	|	'\u0033' // DIGIT THREE
> 	|	'\u0034' // DIGIT FOUR
> 	|	'\u0035' // DIGIT FIVE
> 	|	'\u0036' // DIGIT SIX
> 	|	'\u0037' // DIGIT SEVEN
> 	|	'\u0038' // DIGIT EIGHT
> 	|	'\u0039' // DIGIT NINE
> 	;
>
> token
> 	:	identifier
> 	|	KEYWORD[true] // Use all keywords
> 	|	integer_literal
> 	|	real_literal
> 	|	character_literal
> 	|	string_literal
> 	|	OPREATER_OR_PUNCTUATOR
> 	;
>
> identifier
> 	:	available_identifier
> 	|	'@' identifier_or_keyword[true]
> 	;
> 	
> fragment available_identifier
> 	:	identifier_or_keyword[false] // An identifier_or_keyword that is  
> not a keyword
> 	;
>
> // The booleean allowKeywords determines, if identifier_or_keyword  
> may actually include keywords in the current context.
> fragment identifier_or_keyword[bool allowKeywords]
> 	:	identifier_start_character identifier_part_character*
> 	;
> 	
> fragment identifier_start_character
> 	:	letter_character
> 	|	'_' // (the underscore character U+005F)
> 	;
> 	
> fragment identifier_part_character
> 	:	letter_character
> 	|	decimal_digit_character
> 	|	connecting_character
> 	|	combining_character
> 	|	formatting_character
> 	;
> 	
> fragment letter_character
> 	:	UNICODE_CLASS_Lu // A Unicode character of classes Lu, Ll, Lt,  
> Lm, Lo, or Nl
> 	|	UNICODE_CLASS_Ll
> 	|	UNICODE_CLASS_Lt
> 	|	UNICODE_CLASS_Lm
> 	|	UNICODE_CLASS_Lo
> 	|	UNICODE_CLASS_Nl
> 	|	unicode_escape_sequence["LAndNl"] // An encoded character of  
> classes Lu, Ll, Lt, Lm, Lo, or Nl
> 	;
>
> fragment combining_character
> 	:	UNICODE_CLASS_Mn // A Unicode character of classes Mn or Mc
> 	|	UNICODE_CLASS_Mc
> 	|	unicode_escape_sequence["MnAndMc"] // An encoded character of  
> classes Mn or Mc
> 	;
> 	
> fragment decimal_digit_character
> 	:	UNICODE_CLASS_Nd // A Unicode character of the class Nd
> 	|	unicode_escape_sequence["Nd"] // An encoded character of classes Nd
> 	;
> 	
> fragment connecting_character
> 	:	UNICODE_CLASS_Pc // A Unicode character of the class Pc
> 	|	unicode_escape_sequence["Pc"] // An encoded character of classes Pc
> 	;
> 	
> fragment formatting_character
> 	:	UNICODE_CLASS_Cf // A Unicode character of the class Cf
> 	|	unicode_escape_sequence["Cf"] // An encoded character of classes Cf
> 	;
> 	
> // Allowed unicodeClasses values are "LandNl", "MnAndMc", "Nd",  
> "Pc", "Cf" and "SingleCharacter"
> // The classes restrict the possible unicode values according the  
> Unicode standard.
> // "SingleCharacter" allows every value between U+0000 and U+FFFF  
> inclusive.
> // Detect if '\' is followed by a character not of this group: ',  
> ", \, 0, a, b, f, n, r, t, u, U, x, v
> fragment unicode_escape_sequence[string unicodeClasses]
> 	:	'\u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
> 	|	'\U' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT  
> HEX_DIGIT HEX_DIGIT
> 	;
>
> // This boolean allows the exclusion of the keywords 'true' and  
> 'false'
> KEYWORD[bool useBooleanKeywords]
> 	:	'abstract'
> 	|	'as'
> 	|	'base'
> 	|	'bool'
> 	|	'break'
> 	|	'byte'
> 	|	'case'
> 	|	'catch'
> 	|	'char'
> 	|	'checked'
> 	|	'class'
> 	|	'const'
> 	|	'continue'
> 	|	'decimal'
> 	|	'default'
> 	|	'delegate'
> 	|	'do'
> 	|	'double'
> 	|	'else'
> 	|	'enum'
> 	|	'event'
> 	|	'explicit'
> 	|	'extern'
> 	|	'false'
> 	|	'finally'
> 	|	'fixed'
> 	|	'float'
> 	|	'for'
> 	|	'foreach'
> 	|	'goto'
> 	|	'if'
> 	|	'implicit'
> 	|	'in'
> 	|	'int'
> 	|	'interface'
> 	|	'internal'
> 	|	'is'
> 	|	'lock'
> 	|	'long'
> 	|	'namespace'
> 	|	'new'
> 	|	'null'
> 	|	'object'
> 	|	'operator'
> 	|	'out'
> 	|	'override'
> 	|	'params'
> 	|	'private'
> 	|	'protected'
> 	|	'public'
> 	|	'readonly'
> 	|	'ref'
> 	|	'return'
> 	|	'sbyte'
> 	|	'sealed'
> 	|	'short'
> 	|	'sizeof'
> 	|	'stackalloc'
> 	|	'static'
> 	|	'string'
> 	|	'struct'
> 	|	'switch'
> 	|	'this'
> 	|	'throw'
> 	|	'true'
> 	|	'try'
> 	|	'typeof'
> 	|	'uint'
> 	|	'ulong'
> 	|	'unchecked'
> 	|	'unsafe'
> 	|	'ushort'
> 	|	'using'
> 	|	'virtual'
> 	|	'void'
> 	|	'volatile'
> 	|	'while'
> 	;
> 	
> BOOLEAN_LITERAL
> 	:	'true'
> 	|	'false'
> 	;
> 	
> integer_literal
> 	:	decimal_integer_literal
> 	|	hexadecimal_integer_literal
> 	;
> 	
> fragment decimal_integer_literal
> 	:	DECIMAL_DIGIT+   INTEGER_TYPE_SUFFIX?
> 	;
>
> fragment DECIMAL_DIGIT
> 	: '0'..'9'
> 	;
> 	
> fragment INTEGER_TYPE_SUFFIX
> 	:	'U'
> 	|	'u'
> 	|	'L'
> 	|	'l'
> 	|	'UL'
> 	|	'Ul'
> 	|	'uL'
> 	|	'ul'
> 	|	'LU'
> 	|	'Lu'
> 	|	'lU'
> 	|	'lu'
> 	;
> 	
> fragment hexadecimal_integer_literal
> 	:	'0x'   HEX_DIGIT+   INTEGER_TYPE_SUFFIX?
> 	|	'0X'   HEX_DIGIT+   INTEGER_TYPE_SUFFIX?
> 	;
> 	
> fragment HEX_DIGIT
> 	:	'0'..'9'
> 	|	'A'..'F'
> 	|	'a'..'f'
> 	;
> 	
> real_literal
> 	:	DECIMAL_DIGIT+ '.' DECIMAL_DIGIT+ exponent_part? REAL_TYPE_SUFFIX?
> 	|	'.' DECIMAL_DIGIT+ exponent_part? REAL_TYPE_SUFFIX?
> 	|	DECIMAL_DIGIT+ exponent_part REAL_TYPE_SUFFIX?
> 	|	DECIMAL_DIGIT+ REAL_TYPE_SUFFIX
> 	;
>
> fragment exponent_part
> 	:	'e' SIGN? DECIMAL_DIGIT+
> 	|	'E' SIGN? DECIMAL_DIGIT+
> 	;
>
> fragment SIGN
> 	:	'+'
> 	|	'-'
> 	;
> 	
> fragment REAL_TYPE_SUFFIX
> 	:	'F'
> 	|	'f'
> 	|	'D'
> 	|	'd'
> 	|	'M'
> 	|	'm'
> 	;
>
> character_literal
> 	:	''' character '''
> 	;
> 	
> fragment character
> 	:	SINGLE_CHARACTER
> 	|	SIMPLE_ESCAPE_SEQUENCE
> 	|	hexadecimal_escape_sequence
> 	|	unicode_escape_sequence
> 	;
> 	
> fragment SINGLE_CHARACTER
> 	:	~(''' | '\' | NEW_LINE_CHARACTER )
> 	;
>
> // Detect if '\' is followed by a character not of this group: ',  
> ", \, 0, a, b, f, n, r, t, u, U, x, v
> fragment SIMPLE_ESCAPE_SEQUENCE
> 	:	'\''
> 	|	'\"'
> 	|	'\\'
> 	|	'\0'
> 	|	'\a'
> 	|	'\b'
> 	|	'\f'
> 	|	'\n'
> 	|	'\r'
> 	|	'\t'
> 	|	'\v'
> 	;
> 	
> // Detect if '\' is followed by a character not of this group: ',  
> ", \, 0, a, b, f, n, r, t, u, U, x, v
> fragment hexadecimal_escape_sequence
> 	:	'\x' HEX_DIGIT HEX_DIGIT? HEX_DIGIT? HEX_DIGIT?
> 	;
> 	
> string_literal
> 	:	regular_string_literal
> 	|	verbatim_string_literal
> 	;
> 	
> regular_string_literal
> 	:	'"' regular_string_literal_character* '"'
> 	;
>
> fragment regular_string_literal_character
> 	:	SINGLE_REGULAR_STRING_LITERAL_CHARACTER
> 	|	SIMPLE_ESCAPE_SEQUENCE
> 	|	hexadecimal_escape_sequence
> 	|	unicode_escape_sequence
> 	;
>
> fragment SINGLE_REGULAR_STRING_LITERAL_CHARACTER
> 	:	~( '"' | '\' | NEW_LINE_CHARACTER )
> 	;
> 	
> verbatim_string_literal
> 	:	'@"' verbatim_string_literal_character* '"'
> 	;
>
> fragment verbatim_string_literal_character
> 	:	SINGLE_VERBATIM_STRING_LITERAL_CHARACTER
> 	|	QUTOE_ESCAPE_SEQUENCE
> 	;
> 	
> fragment SINGLE_VERBATIM_STRING_LITERAL_CHARACTER
> 	:	~'"'
> 	;
> 	
> fragment QUTOE_ESCAPE_SEQUENCE
> 	:	'""'
> 	;
>
> NULL_LITERAL
> 	:	'null'
> 	;
> 	
> OPREATER_OR_PUNCTUATOR
> 	:	'{'
> 	|	'}'
> 	|	'['
> 	|	']'
> 	|	'('
> 	|	')'
> 	|	'.'
> 	|	','
> 	|	':'
> 	|	';'
> 	|	'+'
> 	|	'-'
> 	|	'*'
> 	|	'/'
> 	|	'%'
> 	|	'&'
> 	|	'|'
> 	|	'^'
> 	|	'!'
> 	|	'~'
> 	|	'='
> 	|	'<'
> 	|	'>'
> 	|	'?'
> 	|	'??'
> 	|	'::'
> 	|	'++'
> 	|	'--'
> 	|	'&&'
> 	|	'||'
> 	|	'->'
> 	|	'=='
> 	|	'!='
> 	|	'<='
> 	|	'>='
> 	|	'+='
> 	|	'-='
> 	|	'*='
> 	|	'/='
> 	|	'%='
> 	|	'&='
> 	|	'|='
> 	|	'^='
> 	|	'<<'
> 	|	'<<='
> 	;
> 	
> fragment right_shift
> 	:	'>' '>'
> 	;
> 	
> fragment right_shift_assignment
> 	:	'>' '>='
> 	;
>
> // The compiler has to tell, if some preprocessor directives are  
> missing or out of order (regions and conditionals)
> pp_directive
> 	:	pp_declaration
> 	|	pp_conditional
> 	|	pp_line
> 	|	pp_diagnostic
> 	|	pp_region
> 	|	pp_pragma
> 	;
> 	
> conditional_symbol
> 	:	identifier
> 	|	KEYWORD[false] // Any keyword except 'true' or 'false'
> 	;
>
> pp_expression
> 	:	whitespace? pp_or_expression whitespace?
> 	;
> 	
> pp_or_expression
> 	:	pp_and_expression
> 	:	pp_or_expression whitespace? '||' whitespace? pp_and_expression
> 	;
>
> pp_and_expression
> 	:	(pp_equality_expression) (whitespace? '&&' whitespace?  
> pp_equality_expression)*
> 	;
>
> pp_equality_expression
> 	:	(pp_unary_expression) (whitespace? '==' whitespace?  
> pp_unary_expression | whitespace? '!=' whitespace?  
> pp_unary_expression)*
> 	;
>
> pp_unary_expression
> 	:	pp_primary_expression
> 	|	'!' whitespace? pp_unary_expression
> 	;
>
> pp_primary_expression
> 	:	'true'
> 	|	'false'
> 	|	conditional_symbol
> 	|	'(' whitespace? pp_expression whitespace? ')'
> 	;
>
> /*
> The processing of a #define directive causes the given conditional  
> compilation symbol to become defined,
> starting with the source line that follows the directive. Likewise,  
> the processing of a #undef directive
> causes the given conditional compilation symbol to become  
> undefined, starting with the source line that
> follows the directive.
>
> Any #define and #undef directives in a source file shall occur  
> before the first token (§9.4) in the source
> file; otherwise a compile-time error occurs. In intuitive terms,  
> #define and #undef directives shall
> precede any “real code” in the source file.
> */
> pp_declaration
> 	:	whitespace? '#' whitespace? 'define' whitespace  
> conditional_symbol pp_new_line
> 	|	whitespace? '#' whitespace? 'undef' whitespace  
> conditional_symbol pp_new_line
> 	;
>
> pp_new_line
> 	:	whitespace? single_line_comment? NEW_LINE
> 	;
>
> /*
> A pp-conditional selects at most one of the contained conditional- 
> sections for normal lexical processing:
>
> - The pp-expressions of the #if and #elif directives are evaluated  
> in order until one yields true. If an
> expression yields true, the conditional-section of the  
> corresponding directive is selected.
> - If all pp-expressions yield false, and if a #else directive is  
> present, the conditional-section of the
> #else directive is selected.
> - Otherwise, no conditional-section is selected.
>
> The selected conditional-section, if any, is processed as a normal  
> input-section: the source code contained in
> the section shall adhere to the lexical grammar; tokens are  
> generated from the source code in the section; and
> pre-processing directives in the section have the prescribed effects.
>
> The remaining conditional-sections, if any, are processed as  
> skipped-sections: except for pre-processing
> directives, the source code in the section need not adhere to the  
> lexical grammar; no tokens are generated
> from the source code in the section; and pre-processing directives  
> in the section shall be lexically correct but
> are not otherwise processed. Within a conditional-section that is  
> being processed as a skipped-section, any
> nested conditional-sections (contained in nested #if...#endif and  
> #region...#endregion constructs) are
> also processed as skipped-sections.
> */
> pp_conditional
> 	:	pp_if_section pp_elif_section* pp_else_section? pp_endif
> 	;
>
> pp_if_section
> 	:	whitespace? '#' whitespace? 'if' whitespace pp_expression  
> pp_new_line conditional_section?
> 	;
>
> pp_elif_section
> 	:	whitespace? '#' whitespace? 'elif' whitespace pp_expression  
> pp_new_line conditional_section?
> 	;
>
> pp_else_section
> 	:	whitespace? '#' whitespace? 'else' pp_new_line conditional_section?
> 	;
>
> pp_endif
> 	:	whitespace? '#' whitespace? 'endif' pp_new_line
> 	;
>
> conditional_section
> 	:	input_section
> 	|	skipped_section+
> 	;
>
> 	
> skipped_section
> 	:	whitespace? skipped_characters? NEW_LINE
> 	|	pp_directive
> 	;
>
> skipped_characters
> 	:	NOT_NUMBER_SIGN  INPUT_CHARACTER*
> 	;
>
> NOT_NUMBER_SIGN
> 	:	~'#' // Any input_character except #
> 	;
> 	
> pp_diagnostic
> 	:	whitespace? '#' whitespace? 'error' pp_message
> 	|	whitespace? '#' whitespace? 'warning' pp_message
> 	;
>
> pp_message
> 	:	NEW_LINE
> 	|	whitespace INPUT_CHARACTER* NEW_LINE
> 	;
>
> /*
> No semantic meaning is attached to a region; regions are intended  
> for use by the programmer or by
> automated tools to mark a section of source code. The message  
> specified in a #region or #endregion
> directive likewise has no semantic meaning; it merely serves to  
> identify the region. Matching
> #region and #endregion directives can have different pp-messages.
>
> The lexical processing of a region:
>
> #region
> ...
> #endregion
>
> corresponds exactly to the lexical processing of a conditional  
> compilation directive of the form:
>
> #if true
> ...
> #endif
> */
> pp_region
> 	:	pp_start_region conditional_section? pp_end_region
> 	;
>
> pp_start_region
> 	:	whitespace? '#' whitespace? 'region' pp_message
> 	;
>
> pp_end_region
> 	:	whitespace? '#' whitespace? 'endregion' pp_message
> 	;
>
> /*
> When no #line directives are present, the compiler reports true  
> line numbers and source file names in its
> output. When processing a #line directive that includes a line- 
> indicator that is not identifier-or-keyword,
> the compiler treats the line after the directive as having the  
> given line number (and file name, if specified).
>
> A #line directive in which the line-indicator is an identifier-or- 
> keyword whose value equals default
> (using equality as specified in §9.4.2) reverses the effect of all  
> preceding #line directives. The compiler
> reports true line information for subsequent lines, precisely as if  
> no #line directives had been processed.
>
> The purpose of a line-indicator with an identifier-or-keyword whose  
> value does not equal default is
> implementation-defined. An implementation that does not recognize  
> such an identifier-or-keyword in a line-
> indicator shall issue a warning.
> */
> pp_line
> 	:	whitespace? '#' whitespace? 'line' whitespace line_indicator  
> pp_new_line
> 	;
>
> line_indicator
> 	:	DECIMAL_DIGIT+ whitespace file_name
>  	|	DECIMAL_DIGIT+
>  	|	identifier_or_keyword
>  	;
>
> file_name
> 	:	'"' FILE_NAME_CHARACTER+ '"'
> 	;
>
> FILE_NAME_CHARACTER
> 	:	~( '"' | NEW_LINE_CHARACTER ) // Any character except " (U 
> +0022), and new_line_character
> 	;
>
> pp_pragma
> 	:	whitespace? '#' whitespace? 'pragma' pp_pragma_text
> 	;
> 	
> pp_pragma_text
> 	:	NEW_LINE
> 	|	whitespace INPUT_CHARACTER* NEW_LINE
> 	;



More information about the antlr-interest mailing list