[antlr-interest] xml grammar
Oliver Zeigermann
oliver.zeigermann at gmail.com
Tue Nov 15 10:51:16 PST 2005
Hi Torsten!
You will need something like lexer modes which you can simulate using
semantic predicates. In one mode you can match text and in the other -
the tag mode - you will have all those special characters. You will
have to switch to that tag mode upon seeing the LT and switch back
upon GT.
Oliver
P.S.: There actually is an existing XML grammar in the examples that
come with ANTLR. It is a lexer only solution, though.
2005/11/15, Torsten Curdt <tcurdt at vafer.org>:
> I have to cope with a pre-XML-standard so I cannot use one
> of popular parsers. I am surprised I cannot find an antlr
> XML grammar on the net.
>
> I gave it a try and I've run into some problems with the
> lexer as the lexer cannot distinguish between the literal
> that identifies a tag name or the characters inside a tag.
>
> From the context of the token it should be obvious what it
> is.
>
> Here is what I've come up so far. Of course PIs, comments
> and DOCTYPE declarations are still missing...
>
> Anyone some insights on how to solve that?
>
> --------------
>
> header {
> package my.package;
> }
>
> class MyParser extends Parser;
>
> options {
> k=2;
> }
>
> parse
> :
> ( tag )+
> ;
>
> tag
> : LT tag1:LITERAL (COLON tag2:LITERAL)? (WS)*
> (attr1:LITERAL (COLON attr2:LITERAL)? EQ value1:QLITERAL (WS)*
> { } )*
> { System.out.println("started " + tag1 + tag2); }
> ((SLASH GT) | (GT tagbody LT SLASH LITERAL (COLON LITERAL)? GT))
> { System.out.println("end " + tag1 + tag2); }
> ;
>
> tagbody
> : (characters)? (tag (characters)?)*
> ;
>
> characters
> : text:CHARACTERS { System.out.println("characters[" + text + "]"); }
> ;
>
> class MyLexer extends Lexer;
>
> options {
> k=2;
> charVocabulary='\u0000'..'\uFFFE';
> }
>
> CHARACTERS:
> (~('<'))+
> ;
>
> LITERAL:
> ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')+
> ;
>
> QLITERAL:
> '"'! (ESC | ~('\\'|'"'))* '"'!
> ;
>
> protected
> ESC:
> '\\' ('\\'|'t'|'n'|'r'|'"') ;
>
>
> WS : (' '|'\t'|'\r'|'\n') ;
>
> LT : '<' ;
> GT : '>' ;
> EQ : '=' ;
> COLON : ':' ;
> SLASH : '/' ;
>
>
>
>
More information about the antlr-interest
mailing list