[antlr-interest] about antlr3 lexer output file size(C target)

David-Sarah Hopwood david-sarah at jacaranda.org
Wed Dec 2 20:18:17 PST 2009


miao wrote:
> hi,all
> I  want to make a html parser( and lexer )with ANTLR.
> My .g file size is 47k,when I use ANTLR3.1.3, the lexer(C target) size is
> about 7M,but when I change to ANTLR3.2,the created lexer file become
> 35M.although the parser size reduce from 670k to 640k
> 
> Is it a bug of 3.2?
> why?and how i can reduce the file size?
> Maybe my .g file have too many tokens?
> Maybe i use fragments and  predicates?
> 
> some lexer code below
> ===========================
> //only accept "HTML" in '<' & '>' but not between '"' & '"'
> HTML		:{true==ctx->m_bTagMode&&false==ctx->m_bStringMode}?=>(H T M L)
> 		;
> //lex both ansi and unicode and utf8 input
> fragment	A
> 			:{ctx->m_eEncodingType==ET_UNICODE_LITTLE}?=>('a'|'A')('\u0000')
> 			|{ctx->m_eEncodingType==ET_UNICODE_BIG}?=>('\u0000')('a'|'A')
> 			|('a'|'A')
> 			;
> ===========================

7 MB -> 35 MB sounds like a pretty serious regression. That said,
I think that the way you're handling encodings here is definitely
suboptimal. You're almost certainly better off converting the
input to a fixed encoding before lexing it. For the Java target that
would be UTF-16; I don't know what would be best for C.

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 292 bytes
Desc: OpenPGP digital signature
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20091203/c2bf81aa/attachment.bin 


More information about the antlr-interest mailing list