[antlr-interest] about antlr3 lexer output file size(C target)
David-Sarah Hopwood
david-sarah at jacaranda.org
Wed Dec 2 20:18:17 PST 2009
miao wrote:
> hi,all
> I want to make a html parser( and lexer )with ANTLR.
> My .g file size is 47k,when I use ANTLR3.1.3, the lexer(C target) size is
> about 7M,but when I change to ANTLR3.2,the created lexer file become
> 35M.although the parser size reduce from 670k to 640k
>
> Is it a bug of 3.2?
> why?and how i can reduce the file size?
> Maybe my .g file have too many tokens?
> Maybe i use fragments and predicates?
>
> some lexer code below
> ===========================
> //only accept "HTML" in '<' & '>' but not between '"' & '"'
> HTML :{true==ctx->m_bTagMode&&false==ctx->m_bStringMode}?=>(H T M L)
> ;
> //lex both ansi and unicode and utf8 input
> fragment A
> :{ctx->m_eEncodingType==ET_UNICODE_LITTLE}?=>('a'|'A')('\u0000')
> |{ctx->m_eEncodingType==ET_UNICODE_BIG}?=>('\u0000')('a'|'A')
> |('a'|'A')
> ;
> ===========================
7 MB -> 35 MB sounds like a pretty serious regression. That said,
I think that the way you're handling encodings here is definitely
suboptimal. You're almost certainly better off converting the
input to a fixed encoding before lexing it. For the Java target that
would be UTF-16; I don't know what would be best for C.
--
David-Sarah Hopwood ⚥ http://davidsarah.livejournal.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 292 bytes
Desc: OpenPGP digital signature
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20091203/c2bf81aa/attachment.bin
More information about the antlr-interest
mailing list