[antlr-interest] antlr-interest Digest, Vol 30, Issue 33

Wed May 16 07:53:31 PDT 2007

Mark Venbrux wrote:
>     Subject: Re: [antlr-interest] Separating text from comment in a
>     source file
>     Mark Venbrux wrote:
>     > Hi,
>     >
>     > I would like to separate text from comment in a source file. In
>     the end
>     > I would like to keep the source code as is, and to process the
>     comments.
>     > I tried the following grammar, but it doesn't work as expected.
>     > Warning: [17:01:59] warning(201): d:\temp\antlr.g:26:40: The following
>     > alternatives are unreachable: 1
>     > This is about the TEXT rule.
>     >
>     > Comments are picked up, but text is skipped. What is wrong here?
>     >
>     > Cheers,
>     > Mark
> 
>     Why is filter=true? With your grammar you don't actually skip anything
>     but match all text into some tokens. If this isn't enough, then you need
>     to change TEXT into something like that: 
> 
> 
> Thanks for the quick response.
> OK I cleaned up my grammar. In AWorks I now get:
> line 1:0 required (...)+ loop did not match anything at character 'l'
> line 1:1 required (...)+ loop did not match anything at character 'i'
> line 1:2 required (...)+ loop did not match anything at character 'n'
> line 1:3 required (...)+ loop did not match anything at character 'e'
> ....
> with an early exit exception. I also fiddled with syntactic predicates
> but this doesn't work either.
> Should be simple enough though? This is the grammar:
> 
>> grammar MiCoGen;
>>    
>>  file:  (
>>                 ML_COMMENT  {System.out.print("CS: "+$ML_COMMENT.text);}
>>         |       TEXT        {System.out.print("TE: "+$TEXT.text);}
>>         )+
>>         ;
>>    
>>    
>>  ML_COMMENT : '/*' (options {greedy=false;} : .)+ '*/' ;
>>    
>>  TEXT :  (options {greedy=false;} : ~COMMENT_STARTER )+;
>>    
>>  COMMENT_STARTER : '/*';

I've tried to circumvent the exception problem (I have no idea why that
exception actually happens - maybe someone else can shed light on it)
with another grammar based on C#, but while I could define the comments
exactly, I didn't manage to define the text correctly. Below is the
important part (with some extra rules, you may find useful) of the C#
grammar. Maybe you manage to find the solution.

Best regards,
Johannes Luber

input
	:	input_section* EOF
	;

input_section
	:	input_element* NEW_LINE
	|	pp_directive
	;

input_element
	:	whitespace
	|	comment
	|	token
	;

whitespace
	:	WHITESPACE_CHARACTERS
	;

fragment WHITESPACE_CHARACTERS
	:	WHITESPACE_CHARACTER+
	;

fragment WHITESPACE_CHARACTER
	:	UNICODE_CLASS_Zs
	|	'\u0009' // Horizontal tab character
	|	'\u000B' // Vertical tab character
	|	'\u000C' // Form feed character
	;

NEW_LINE
	:	'\u000D' // Carriage return character
	|	'\u000A' // Line feed character
	|	'\r\n'	 // '\u000D\u000A' doesn't work in Java // Carriage return
character followed by line feed character
	|	'\u0085' // Next line character
	|	'\u2028' // Line separator character
	|	'\u2029' // Paragraph separator character
	;

comment
	:	single_line_comment
	|	delimited_comment
	;

single_line_comment
	:	'//' INPUT_CHARACTER*
	;

fragment INPUT_CHARACTER
	:	~NEW_LINE_CHARACTER // Any Unicode character except a new_line_character
	;

NEW_LINE_CHARACTER
	:	'\u000D' // Carriage return character
	|	'\u000A' // Line feed character
	|	'\u0085' // Next line character
	|	'\u2028' // Line separator character
	|	'\u2029' // Paragraph separator character
	;

delimited_comment
	:	'/*' DELIMITED_COMMENT_SECTION* ASTERISKS '/'
	;

fragment DELIMITED_COMMENT_SECTION
	:	NOT_ASTERISK
	|	ASTERISKS NOT_SLASH
	;

fragment ASTERISKS
	:	('*') ('*')*
	;

fragment NOT_ASTERISK
	:	~'*' // Any Unicode character except *
	;

fragment NOT_SLASH
	:	~'/' // Any Unicode character except /
	;

fragment UNICODE_CLASS_Zs // Any character with Unicode class Zs (18
characters known)
	:	'\u0020' // SPACE
	|	'\u00A0' // NO_BREAK SPACE
	|	'\u1680' // OGHAM SPACE MARK
	|	'\u180E' // MONGOLIAN VOWEL SEPARATOR
	|	'\u2000' // EN QUAD
	|	'\u2001' // EM QUAD
	|	'\u2002' // EN SPACE
	|	'\u2003' // EM SPACE
	|	'\u2004' // THREE_PER_EM SPACE
	|	'\u2005' // FOUR_PER_EM SPACE
	|	'\u2006' // SIX_PER_EM SPACE
	|	'\u2008' // PUNCTUATION SPACE
	|	'\u2009' // THIN SPACE
	|	'\u200A' // HAIR SPACE
	|	'\u202F' // NARROW NO_BREAK SPACE
	|	'\u3000' // IDEOGRAPHIC SPACE
	|	'\u205F' // MEDIUM MATHEMATICAL SPACE
	;

token
	:	identifier
	|	keyword
	|	integer_literal
	|	real_literal
	|	character_literal
	|	string_literal
	|	OPERATOR_OR_PUNCTUATOR
	;