[antlr-interest] Fwd: Sparql Grammar & Huge C Files

Sat Aug 20 09:03:58 PDT 2011

Begin forwarded message:

> From: Todor Dimitrov <todor.dimitrov at stud.uni-due.de>
> Subject: Re: [antlr-interest] Sparql Grammar & Huge C Files
> Date: August 20, 2011 5:52:33 PM GMT+02:00
> To: Jim Idle <jimi at temporal-wave.com>
> 
> Hi Jim,
> 
> this is an open source grammar for the Sparql language that has not been developed by me. I have run the ANTLR tool like this:
> 
> java -Xms1024m -Xmx1024m -cp antlr-3.4-complete.jar org.antlr.Tool Sparql.g
> 
> No warnings have been outputted and looking at the ANTLR tool options, I don't see any switches that would enable/disable warnings generation. I'm not using the SETTEXT macro and I'm not quite sure where to use it. Are there any examples for it? In addition, the Sparql grammar contains only rewriting rules so I'm not sure whether I have to use the SETTEXT macro. I've attached the grammar file for reference.
> 
> Todor
> 
> 
> On Aug 20, 2011, at 5:36 PM, Jim Idle wrote:
> 
>> The huge file size occurs because your lexer/parser is probably trying to
>> do too much or asking ANTLR to do lots of disambiguation and the complex
>> overlaps are generating huge tables. In the case of the parser, I suspect
>> that you need some single token predicates to help with keyword
>> disambiguation; have you removed ALL the warnings that ANTLR generates on
>> your grammar? If you do not remove all the warnings then this sort of
>> thing happens a lot. Especially on a terrible language such as SQL has
>> morphed in to.
>> 
>> The code only LOOKS small in Java because the generated java uses run
>> length encoded strings for the table values that it must expand at runtime
>> - the C target lays down the exact same tables, but in static so that it
>> is set up at compile time. Java is unable to use compile time initialized
>> tables like this until JDK 1.7, so the Java target must jump through hoops
>> to generate the tables. So in fact generating the C is a better indicator
>> of how efficient your grammar is. You can probably trace the table sizes
>> down to a few key decisions.
>> 
>> Your set text errors are likely that you are not using the SETTEXT macro
>> correctly in some way. Also, I would avoid doing that at lex time and do
>> any manipulation if you actually use the token in question. I can't help
>> unless I see the lexer code in question though.
>> 
>> Use the 3.4 beta C runtime - there is no difference in the release version
>> except for the API documentation that I keep trying to finish but my boat
>> keeps winking at me and making me go on the river.
>> 
>> 
>> Jim
>> 
>> 
>> 
>>> -----Original Message-----
>>> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>>> bounces at antlr.org] On Behalf Of Todor Dimitrov
>>> Sent: Saturday, August 20, 2011 7:39 AM
>>> To: antlr-interest at antlr.org
>>> Subject: [antlr-interest] Sparql Grammar & Huge C Files
>>> 
>>> Dear *,
>>> 
>>> generating the C lexer and parser for the Sparql grammar using the
>>> options below produces huge files:
>>> 
>>> options {
>>> 	language = C;
>>> 	output = AST;
>>> 	ASTLabelType = pANTLR3_BASE_TREE;
>>> }
>>> 
>>> 2.4K Sparql.tokens
>>> 85M SparqlLexer.c <---
>>> 30K SparqlLexer.h
>>> 1.5M SparqlParser.c <---
>>> 69K SparqlParser.h
>>> 
>>> In addition, the files cannot be compiled as it seems that the
>>> generators have not been updated to reflect the API changes in the
>>> latest C runtime (or maybe it is the other way round :)). In
>>> particular, I see errors like these:
>>> 
>>> SparqlLexer.c:1214276:48: error: member reference type 'pANTLR3_STRING'
>>> (aka 'struct ANTLR3_STRING_struct *') is a
>>>     pointer; maybe you meant to use '->'?
>>>                    setText(LEXER->getText(LEXER).substring(1, LEXER-
>>>> getText(LEXER).length()-1));
>>>                            ~~~~~~~~~~~~~~~~~~~~~^
>>>                                                 ->
>>> SparqlLexer.c:1214276:49: error: no member named 'substring' in 'struct
>>> ANTLR3_STRING_struct'; did you mean 'subString'?
>>>                    setText(LEXER->getText(LEXER).substring(1, LEXER-
>>>> getText(LEXER).length()-1));
>>>                                                  ^~~~~~~~~
>>>                                                  subString
>>> ./antlr3string.h:179:8: note: 'subString' declared here
>>>                                       (*subString)    (struct
>>> ANTLR3_STRING_struct * string, ANTLR3_UINT32 ...
>>>                                         ^
>>> SparqlLexer.c:1214276:83: error: member reference type 'pANTLR3_STRING'
>>> (aka 'struct ANTLR3_STRING_struct *') is a
>>>     pointer; maybe you meant to use '->'?
>>>                    setText(LEXER->getText(LEXER).substring(1, LEXER-
>>>> getText(LEXER).length()-1));
>>> 
>>> ~~~~~~~~~~~~~~~~~~~~~^
>>> 
>>> ->
>>> SparqlLexer.c:1214276:84: error: no member named 'length' in 'struct
>>> ANTLR3_STRING_struct'
>>>                    setText(LEXER->getText(LEXER).substring(1, LEXER-
>>>> getText(LEXER).length()-1));
>>> 
>>> 
>>> I'm using antlr 3.4, but I have also tested this with antlr 3.3.
>>> Generating the Java lexer and parser works as expected and the files
>>> are much smaller:
>>> 
>>> 2.4K Sparql.tokens
>>> 582K SparqlLexer.java
>>> 876K SparqlParser.java
>>> 
>>> Any suggestions and help are highly appreciated.
>>> 
>>> Thanks in advance,
>>> 
>>> Todor
>>> 
>>> 
>>> 
>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
>>> email-address
>> 
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>> 
>