[antlr-interest] Comments parser and non-alphanum characters

Cor Geboers cg0601 at hotmail.com
Mon Apr 19 01:45:04 PDT 2010


Hi, I have a problem with a parser which needs to interpret a comment in a command language. The CL uses commands inside an HTML command pair: '<!--' command '-->' and I can parse most commands, except for the REM command which is a comment remark and should be ignored.
I wrote a small test grammar, which shows the problem more or less:

grammar Remarks;

options {
  language = Java;
}

rule: commandLine+ ;

commandLine
    :   '<!--' command '-->'
    ;

command
    :   breakCommand 
    |   remarkCommand
    ;
    
remarkCommand
    :   REM (.)*
    ;
    
breakCommand
    :   BREAK
    ;
    
WS
    :   (' ' | '\t' | '\r' | '\n')+ { $channel = HIDDEN; }
    ;

REM
    :   '#' ('R'|'r') ('E'|'e') ('M'|'m')
    ;
    
BREAK
    :   '#' ('B'|'b')('R'|'r')('E'|'e')('A'|'a')('K'|'k');

IDENT : ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'9')*;

A sample command file might look like this:

<!-- #rem some comment -->
<!--        #break -->
<!-- #rem some comment with $AAA &*&^, A9a 5eee and 99922 and .<><> -->

The parser recognizes the rem commands and the break command, but some characters are lost. It also divides the "comment" text into other tokens (IDENT in this case). Ideally I would like to get all characters back as one part, but I tried several constructs without any result.
The last line is even parsed worse: all "special" characters like $, &, etc are generating warnings and not found back into the tokens. The errors/warnings generated are like this:

line 3:28 no viable alternative at character '$'
line 3:33 no viable alternative at character '&'
line 3:34 no viable alternative at character '*'
line 3:35 no viable alternative at character '&'
line 3:36 no viable alternative at character '^'
line 3:37 no viable alternative at character ','
line 3:43 no viable alternative at character '5'
line 3:52 no viable alternative at character '9'
line 3:53 no viable alternative at character '9'

How can I create the comment, so that all characters are either ignored or returned as one rule or token ? It should do so only when inside a comment. I looked at other grammars for comments, like C with /* */ and see they do about the same.
 		 	   		  
_________________________________________________________________
Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
https://signup.live.com/signup.aspx?id=60969


More information about the antlr-interest mailing list