[antlr-interest] Parsing XML

Lucas Ontivero lucasontivero at hotmail.com
Thu Aug 28 13:34:07 PDT 2008


Hi all,
I am making an articles processor which load technical articles from .txt files and convert him to HTML/DOC/etc.. these articles has tags like [link][/link], [strong][/strong], etc. It is very similar to XML  so I am reusing the grammar from "Parsing XML" (http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML)

The problem is the ArticleProcessorLexer.cs is very large (2.08 MB). My project requiere high performance because the articles could be large and my component is part of a web application which could be several request in a same time. I need to do ( PCDATA : {!tagMode}?=> (~'[')+ ; ) in a better way.

I am a newbe with antlr, may be I am confused but, is my grammar ok?

thank you.

/* Begin Grammar ---------------------------------------------------------------------------------------------------------------------------------------------------------------/
grammar ArticleProcessor;

options{
    language=CSharp;
    output = AST;
    ASTLabelType = CommonTree;
}

@header {
using System.Collections;
}

@lexer::namespace  { ArticleProcessor.Lexer  }
@parser::namespace { ArticleProcessor.Parser }

@lexer::members  { bool tagMode = false;  }

article   :    element   |  EOF    ;

element
    : TAG_START_OPEN NAME (NAME ATTR_EQ ATTRVALUE)*  TAG_CLOSE
        (element
        | PCDATA
        )*
        TAG_END_OPEN NAME TAG_CLOSE
    ;

TAG_START_OPEN : '[' { tagMode = true; } ;
TAG_END_OPEN : '[/' { tagMode = true; } ;
TAG_CLOSE : {tagMode}?=> ']' { tagMode = false; } ;

PCDATA : {!tagMode}?=> (~'[')+ ;

NAME : {tagMode}?=> ( LETTER | '_' | ':') (NAMECHAR)* ;

ATTR_EQ : { tagMode }?=> '=' ;

ATTRVALUE : { tagMode }?=>
        ( '"' (~'"')* '"'
        | '\'' (~'\'')* '\''
        )
    ;


fragment NAMECHAR    : LETTER | DIGIT | '.' | '-' | '_' | ':'    ;

fragment DIGIT    :    '0'..'9'    ;
    
fragment LC    :    'a'..'z'    ;

fragment UC    :    'A'..'Z'    ;

fragment LETTER    : LC|UC    ;


WS  :  {tagMode}?=> (' '|'\r'|'\t'|'\u000C'|'\n')+ {$channel=HIDDEN;}    ;
 
/* End Grammar
---------------------------------------------------------------------------------------------------------------------------------------------------------------/





_________________________________________________________________
Ingresá ya a MSN en Concierto y disfrutá los recitales en vivo de tus artistas favoritos.
http://msninconcert.msn.com/music/archive/es-la/archive.aspx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080828/14007b2f/attachment.html 


More information about the antlr-interest mailing list