[antlr-interest] antlr-interest Digest, Vol 30, Issue 33
Johannes Luber
jaluber at gmx.de
Wed May 16 07:53:31 PDT 2007
Mark Venbrux wrote:
> Subject: Re: [antlr-interest] Separating text from comment in a
> source file
> Mark Venbrux wrote:
> > Hi,
> >
> > I would like to separate text from comment in a source file. In
> the end
> > I would like to keep the source code as is, and to process the
> comments.
> > I tried the following grammar, but it doesn't work as expected.
> > Warning: [17:01:59] warning(201): d:\temp\antlr.g:26:40: The following
> > alternatives are unreachable: 1
> > This is about the TEXT rule.
> >
> > Comments are picked up, but text is skipped. What is wrong here?
> >
> > Cheers,
> > Mark
>
> Why is filter=true? With your grammar you don't actually skip anything
> but match all text into some tokens. If this isn't enough, then you need
> to change TEXT into something like that:
>
>
> Thanks for the quick response.
> OK I cleaned up my grammar. In AWorks I now get:
> line 1:0 required (...)+ loop did not match anything at character 'l'
> line 1:1 required (...)+ loop did not match anything at character 'i'
> line 1:2 required (...)+ loop did not match anything at character 'n'
> line 1:3 required (...)+ loop did not match anything at character 'e'
> ....
> with an early exit exception. I also fiddled with syntactic predicates
> but this doesn't work either.
> Should be simple enough though? This is the grammar:
>
>> grammar MiCoGen;
>>
>> file: (
>> ML_COMMENT {System.out.print("CS: "+$ML_COMMENT.text);}
>> | TEXT {System.out.print("TE: "+$TEXT.text);}
>> )+
>> ;
>>
>>
>> ML_COMMENT : '/*' (options {greedy=false;} : .)+ '*/' ;
>>
>> TEXT : (options {greedy=false;} : ~COMMENT_STARTER )+;
>>
>> COMMENT_STARTER : '/*';
I've tried to circumvent the exception problem (I have no idea why that
exception actually happens - maybe someone else can shed light on it)
with another grammar based on C#, but while I could define the comments
exactly, I didn't manage to define the text correctly. Below is the
important part (with some extra rules, you may find useful) of the C#
grammar. Maybe you manage to find the solution.
Best regards,
Johannes Luber
input
: input_section* EOF
;
input_section
: input_element* NEW_LINE
| pp_directive
;
input_element
: whitespace
| comment
| token
;
whitespace
: WHITESPACE_CHARACTERS
;
fragment WHITESPACE_CHARACTERS
: WHITESPACE_CHARACTER+
;
fragment WHITESPACE_CHARACTER
: UNICODE_CLASS_Zs
| '\u0009' // Horizontal tab character
| '\u000B' // Vertical tab character
| '\u000C' // Form feed character
;
NEW_LINE
: '\u000D' // Carriage return character
| '\u000A' // Line feed character
| '\r\n' // '\u000D\u000A' doesn't work in Java // Carriage return
character followed by line feed character
| '\u0085' // Next line character
| '\u2028' // Line separator character
| '\u2029' // Paragraph separator character
;
comment
: single_line_comment
| delimited_comment
;
single_line_comment
: '//' INPUT_CHARACTER*
;
fragment INPUT_CHARACTER
: ~NEW_LINE_CHARACTER // Any Unicode character except a new_line_character
;
NEW_LINE_CHARACTER
: '\u000D' // Carriage return character
| '\u000A' // Line feed character
| '\u0085' // Next line character
| '\u2028' // Line separator character
| '\u2029' // Paragraph separator character
;
delimited_comment
: '/*' DELIMITED_COMMENT_SECTION* ASTERISKS '/'
;
fragment DELIMITED_COMMENT_SECTION
: NOT_ASTERISK
| ASTERISKS NOT_SLASH
;
fragment ASTERISKS
: ('*') ('*')*
;
fragment NOT_ASTERISK
: ~'*' // Any Unicode character except *
;
fragment NOT_SLASH
: ~'/' // Any Unicode character except /
;
fragment UNICODE_CLASS_Zs // Any character with Unicode class Zs (18
characters known)
: '\u0020' // SPACE
| '\u00A0' // NO_BREAK SPACE
| '\u1680' // OGHAM SPACE MARK
| '\u180E' // MONGOLIAN VOWEL SEPARATOR
| '\u2000' // EN QUAD
| '\u2001' // EM QUAD
| '\u2002' // EN SPACE
| '\u2003' // EM SPACE
| '\u2004' // THREE_PER_EM SPACE
| '\u2005' // FOUR_PER_EM SPACE
| '\u2006' // SIX_PER_EM SPACE
| '\u2008' // PUNCTUATION SPACE
| '\u2009' // THIN SPACE
| '\u200A' // HAIR SPACE
| '\u202F' // NARROW NO_BREAK SPACE
| '\u3000' // IDEOGRAPHIC SPACE
| '\u205F' // MEDIUM MATHEMATICAL SPACE
;
token
: identifier
| keyword
| integer_literal
| real_literal
| character_literal
| string_literal
| OPERATOR_OR_PUNCTUATOR
;
More information about the antlr-interest
mailing list