[antlr-interest] Comments parser and non-alphanum characters

Mon Apr 19 06:08:29 PDT 2010

If you have control of the language, I'd change it to make it easier...

If you don't, that's much harder.  I'd parse it in two passes.  One
that handles  as a single token, and one is feed the input for
 and parses it.

That's been my plan on handling similar issues in a Wiki-like
language.  The only other way to handle (that I know of) it is with a
lot of error handling.  The fact that you're mixing two things, one
that is totally regular and structured, inside the same area is a
problem.  There's a reason every language I know of has an explicit
comment that is totally unstructured other then the delimiters.

HTH,
Kirby

On Mon, Apr 19, 2010 at 3:45 AM, Cor Geboers <cg0601 at hotmail.com> wrote:
>
> Hi, I have a problem with a parser which needs to interpret a comment in a command language. The CL uses commands inside an HTML command pair: '<!--' command '-->' and I can parse most commands, except for the REM command which is a comment remark and should be ignored.
> I wrote a small test grammar, which shows the problem more or less:
>
> grammar Remarks;
>
> options {
>  language = Java;
> }
>
> rule: commandLine+ ;
>
> commandLine
>    :   '<!--' command '-->'
>    ;
>
> command
>    :   breakCommand
>    |   remarkCommand
>    ;
>
> remarkCommand
>    :   REM (.)*
>    ;
>
> breakCommand
>    :   BREAK
>    ;
>
> WS
>    :   (' ' | '\t' | '\r' | '\n')+ { $channel = HIDDEN; }
>    ;
>
> REM
>    :   '#' ('R'|'r') ('E'|'e') ('M'|'m')
>    ;
>
> BREAK
>    :   '#' ('B'|'b')('R'|'r')('E'|'e')('A'|'a')('K'|'k');
>
> IDENT : ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'9')*;
>
> A sample command file might look like this:
>
> <!-- #rem some comment -->
> <!--        #break -->
> <!-- #rem some comment with $AAA &*&^, A9a 5eee and 99922 and .<><> -->
>
> The parser recognizes the rem commands and the break command, but some characters are lost. It also divides the "comment" text into other tokens (IDENT in this case). Ideally I would like to get all characters back as one part, but I tried several constructs without any result.
> The last line is even parsed worse: all "special" characters like $, &, etc are generating warnings and not found back into the tokens. The errors/warnings generated are like this:
>
> line 3:28 no viable alternative at character '$'
> line 3:33 no viable alternative at character '&'
> line 3:34 no viable alternative at character '*'
> line 3:35 no viable alternative at character '&'
> line 3:36 no viable alternative at character '^'
> line 3:37 no viable alternative at character ','
> line 3:43 no viable alternative at character '5'
> line 3:52 no viable alternative at character '9'
> line 3:53 no viable alternative at character '9'
>
> How can I create the comment, so that all characters are either ignored or returned as one rule or token ? It should do so only when inside a comment. I looked at other grammars for comments, like C with /* */ and see they do about the same.
>
> _________________________________________________________________
> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> https://signup.live.com/signup.aspx?id=60969
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>