[antlr-interest] xml grammar

Oliver Zeigermann oliver.zeigermann at gmail.com
Tue Nov 15 10:51:16 PST 2005


Hi Torsten!

You will need something like lexer modes which you can simulate using
semantic predicates. In one mode you can match text and in the other -
the tag mode - you will have all those special characters. You will
have to switch to that tag mode upon seeing the LT and switch back
upon GT.

Oliver

P.S.: There actually is an existing XML grammar in the examples that
come with ANTLR. It is a lexer only solution, though.

2005/11/15, Torsten Curdt <tcurdt at vafer.org>:
> I have to cope with a pre-XML-standard so I cannot use one
> of popular parsers. I am surprised I cannot find an antlr
> XML grammar on the net.
>
> I gave it a try and I've run into some problems with the
> lexer as the lexer cannot distinguish between the literal
> that identifies a tag name or the characters inside a tag.
>
>  From the context of the token it should be obvious what it
> is.
>
> Here is what I've come up so far. Of course PIs, comments
> and DOCTYPE declarations are still missing...
>
> Anyone some insights on how to solve that?
>
> --------------
>
> header {
>      package my.package;
>      }
>
> class MyParser extends Parser;
>
> options {
>         k=2;
> }
>
> parse
>    :
>    ( tag )+
>    ;
>
> tag
>    : LT tag1:LITERAL (COLON tag2:LITERAL)? (WS)*
>      (attr1:LITERAL (COLON attr2:LITERAL)? EQ value1:QLITERAL (WS)*
> {  } )*
>      { System.out.println("started " + tag1 + tag2); }
>      ((SLASH GT) | (GT tagbody LT SLASH LITERAL (COLON LITERAL)? GT))
>      { System.out.println("end " + tag1 + tag2); }
>    ;
>
> tagbody
>    : (characters)? (tag (characters)?)*
>    ;
>
> characters
>    : text:CHARACTERS { System.out.println("characters[" + text + "]"); }
>    ;
>
> class MyLexer extends Lexer;
>
> options {
>         k=2;
>         charVocabulary='\u0000'..'\uFFFE';
> }
>
> CHARACTERS:
>    (~('<'))+
>    ;
>
> LITERAL:
>    ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')+
>    ;
>
> QLITERAL:
>    '"'! (ESC | ~('\\'|'"'))* '"'!
>    ;
>
> protected
> ESC:
>    '\\' ('\\'|'t'|'n'|'r'|'"') ;
>
>
> WS : (' '|'\t'|'\r'|'\n') ;
>
> LT : '<' ;
> GT : '>' ;
> EQ : '=' ;
> COLON : ':' ;
> SLASH : '/' ;
>
>
>
>


More information about the antlr-interest mailing list