[antlr-interest] xml grammar

Torsten Curdt tcurdt at vafer.org
Tue Nov 15 10:13:57 PST 2005


I have to cope with a pre-XML-standard so I cannot use one
of popular parsers. I am surprised I cannot find an antlr
XML grammar on the net.

I gave it a try and I've run into some problems with the
lexer as the lexer cannot distinguish between the literal
that identifies a tag name or the characters inside a tag.

 From the context of the token it should be obvious what it
is.

Here is what I've come up so far. Of course PIs, comments
and DOCTYPE declarations are still missing...

Anyone some insights on how to solve that?

--------------

header {
     package my.package;
     }

class MyParser extends Parser;

options {
	k=2;
}

parse
   :
   ( tag )+
   ;

tag
   : LT tag1:LITERAL (COLON tag2:LITERAL)? (WS)*
     (attr1:LITERAL (COLON attr2:LITERAL)? EQ value1:QLITERAL (WS)*  
{  } )*
     { System.out.println("started " + tag1 + tag2); }
     ((SLASH GT) | (GT tagbody LT SLASH LITERAL (COLON LITERAL)? GT))
     { System.out.println("end " + tag1 + tag2); }
   ;

tagbody
   : (characters)? (tag (characters)?)*
   ;

characters
   : text:CHARACTERS { System.out.println("characters[" + text + "]"); }
   ;

class MyLexer extends Lexer;

options {
	k=2;
	charVocabulary='\u0000'..'\uFFFE';
}

CHARACTERS:
   (~('<'))+
   ;

LITERAL:
   ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')+
   ;

QLITERAL:
   '"'! (ESC | ~('\\'|'"'))* '"'!
   ;

protected
ESC:
   '\\' ('\\'|'t'|'n'|'r'|'"') ;


WS : (' '|'\t'|'\r'|'\n') ;

LT : '<' ;
GT : '>' ;
EQ : '=' ;
COLON : ':' ;
SLASH : '/' ;

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 186 bytes
Desc: This is a digitally signed message part
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20051115/0955443b/PGP.bin


More information about the antlr-interest mailing list