[antlr-interest] xml grammar

Tue Nov 15 20:26:25 PST 2005

Hey!

I'd recommend taking a peek at ANTXR, my ANTLR offshoot for XML parsing.
You'd want to look at the provided scanner for XMLPull as a starting point.

http://javadude.com/tools/antxr

The front-end uses SAX or XMLPull under the covers, but could easily take
whatever scanner you want to use to create the tokens.

This could be very helpful on the parsing end, but you're still on your own
for the scanning...

LMK if this helps...
-- Scott 

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org 
> [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Torsten Curdt
> Sent: Tuesday, November 15, 2005 1:14 PM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] xml grammar
> 
> I have to cope with a pre-XML-standard so I cannot use one of 
> popular parsers. I am surprised I cannot find an antlr XML 
> grammar on the net.
> 
> I gave it a try and I've run into some problems with the 
> lexer as the lexer cannot distinguish between the literal 
> that identifies a tag name or the characters inside a tag.
> 
>  From the context of the token it should be obvious what it is.
> 
> Here is what I've come up so far. Of course PIs, comments and 
> DOCTYPE declarations are still missing...
> 
> Anyone some insights on how to solve that?
> 
> --------------
> 
> header {
>      package my.package;
>      }
> 
> class MyParser extends Parser;
> 
> options {
> 	k=2;
> }
> 
> parse
>    :
>    ( tag )+
>    ;
> 
> tag
>    : LT tag1:LITERAL (COLON tag2:LITERAL)? (WS)*
>      (attr1:LITERAL (COLON attr2:LITERAL)? EQ value1:QLITERAL 
> (WS)* {  } )*
>      { System.out.println("started " + tag1 + tag2); }
>      ((SLASH GT) | (GT tagbody LT SLASH LITERAL (COLON LITERAL)? GT))
>      { System.out.println("end " + tag1 + tag2); }
>    ;
> 
> tagbody
>    : (characters)? (tag (characters)?)*
>    ;
> 
> characters
>    : text:CHARACTERS { System.out.println("characters[" + 
> text + "]"); }
>    ;
> 
> class MyLexer extends Lexer;
> 
> options {
> 	k=2;
> 	charVocabulary='\u0000'..'\uFFFE';
> }
> 
> CHARACTERS:
>    (~('<'))+
>    ;
> 
> LITERAL:
>    ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')+
>    ;
> 
> QLITERAL:
>    '"'! (ESC | ~('\\'|'"'))* '"'!
>    ;
> 
> protected
> ESC:
>    '\\' ('\\'|'t'|'n'|'r'|'"') ;
> 
> 
> WS : (' '|'\t'|'\r'|'\n') ;
> 
> LT : '<' ;
> GT : '>' ;
> EQ : '=' ;
> COLON : ':' ;
> SLASH : '/' ;
> 
>