[antlr-interest] Memory requirements of C runtime when backtracking

Jim Idle jimi at temporal-wave.com
Fri May 6 11:33:52 PDT 2011


I do it like this:


id
	: ID
	| keyword { set token type to id }
;

keyword
 : list them all in order and declare them all together in the lexer
;

But there is no way around the stupidity of SQL I am afraid. Don't put the
keywords in as 'WORD' as then they will not be declared in a contiguous
block and you will have problems identifying them in the error messages.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Ivan Brezina
> Sent: Friday, May 06, 2011 8:10 AM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Memory requirements of C runtime when
> backtracking
>
> You're right the grammar has very large DFAs. That was the second
> problem I'm facing. In general Oracle's SQL distinguishes reserved
> words and keywords.
>
> In order to satisfy backward compatibility many newly added
> words(keywords) have special meaning and but also can be used as
> identifiers.
> for example such a query is legal:
> Select * from join JOIN join using(join);
>
> The uppercase JOIN is a keyword while other "join"s are identifiers.
> In my grammar the rule identifier is defined as identifier
>    : ID //(lexer rule for an identifier)
>    | 'A' //(1st keyword)
>    | 'AS' //(2dn keyword)
>    ...
>    | '' // (200th keyword)
>    ;
>
> I'm afraid that this is one the reasons why the tables are so huge.
> Are there any ways how to deal with something like that?
>
> The previous version of this grammar used gated semantic predicates
> like:
>
> k_join : { LA(1).gettext() == "JOIN" }? => ID
>
> is that the right way to reduce the size of DFA?
>
> Ivan
>
> Quoting Jim Idle <jimi at temporal-wave.com>:
>
> > You should not be using backtrack=true if you are short on memory,
> but
> > without seeing your grammar I cannot comment on the ram usage. It
> > might be that your grammar causes the generation of huge DFA tables.
> > Backtracking itself does not cost lots of memory though.
> >
> > jim
> >
> >> -----Original Message-----
> >> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> >> bounces at antlr.org] On Behalf Of Ivan Brezina
> >> Sent: Wednesday, May 04, 2011 5:20 AM
> >> To: antlr-interest Interest
> >> Subject: [antlr-interest] Memory requirements of C runtime when
> >> backtracking
> >>
> >> Hi all
> >> doing some unit test of Oracle SQL grammar I'm facing problems with
> >> memory requirements when parsing input having many parenthesis.
> >>
> >> In general the grammar can have three types of statements which can
> >> be enclosed in parens.
> >>
> >> 1. value expression like (A),(1+2).
> >> For an sql_expression there is set of rules using numerical operator
> >> sql_expression -> expr_add -> expr_mul -> expr_sign -> expr_pow ->
> >> expr_paren
> >>
> >> 2. logical expression like (A > 2) or (A is null) For an condition
> >> expression there is a set of rules using locical operator
> >> sql_condition
> >> -> condition_or -> condition_and -> condition_not -> condition_paren
> >>
> >> 3. a subquery. It is an sql statement enclosed in parenthesis like
> >> (SELECT * FROM dual).
> >>
> >> These 3 types can by combined in many ways.
> >> For example ((SELECT count(1) from dual) > 1)
> >>
> >> While testing if found queries - probably generated by some evil
> sick
> >> robot - whose require more than 8B of RAM to parse even they are
> >> quite short
> >>
> >> For example:
> >> SELECT *
> >> FROM TABLE1, TABLE2
> >> WHERE
> >> (
> >>   (
> >>    (
> >>     (
> >>      (
> >>       (
> >>        (
> >>         (
> >>          (
> >>   	 (
> >>   	  (
> >>   	   (
> >>   	    (
> >>   	     (
> >>   	      (
> >> 	       ( TABLE2.DT = '2' ) OR ( TABLE2.DT = '3' )
> >>    	      ) AND ( TABLE2.CODE < 9 )
> >>    	     ) AND ( TABLE2.WH = 'XXX' )
> >>    	    ) AND ( TABLE1.ID = '0000001' )
> >>    	   ) AND ( ( TABLE1.ATTR_1 IS NULL ) OR ( TABLE1._ATTR1 =
'*'
> ) )
> >>    	  ) AND ( ( TABLE1.ATTR_2 IS NULL ) OR ( TABLE1._ATTR2 =
'*'
> ) )
> >>    	 ) AND ( ( TABLE1.ATTR_3 IS NULL ) OR ( TABLE1._ATTR3 =
'*' )
> )
> >>       	) AND ( ( TABLE1.ATTR_4 IS NULL ) OR ( TABLE1._ATTR4 = '*'
> > )
> >> )
> >>         ) AND ( ( TABLE1.ATTR_5 IS NULL ) OR ( TABLE1._ATTR5 = '*' )
> )
> >>        ) AND ( ( TABLE2.TYPE IS NULL ) OR ( TABLE2.TYPE = '*' ) )
> >>       ) AND ( ( TABLE2.NBR IS NULL ) OR ( TABLE2.NBR = '*' ) )
> >>      ) AND ( ( TABLE2.STAT = '01' ) OR ( TABLE2.STAT = '*' ) )
> >>     ) AND ( ( TABLE2.ORGN IS NULL ) OR ( TABLE2.ORGN = '*' ) )
> >>    ) AND ( TABLE2.NBR = '00000000' )
> >>   ) AND ( TABLE2.PO IS NULL )
> >> )
> >>
> >> Both value expression and condition expression rules do
> backtracking.
> >> In the example above every parenthesis nesting doubles memory
> >> requirements.
> >>
> >> Are there any ways how to reduce/monitor memory requirements?
> >> What exactly is remembered when backtracking?
> >> I tried to add some syntactic predicates into value/conditional
> >> expression but it usually led to failure of other tests.
> >>
> >> thx Ivan
> >>
> >>
> >>
> >>
> >> ----------------------------------------------------------------
> >> This message was sent using IMP, the Internet Messaging Program.
> >>
> >>
> >> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> >> Unsubscribe:
> >> http://www.antlr.org/mailman/options/antlr-interest/your-
> >> email-address
> >
> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
> > Unsubscribe:
> > http://www.antlr.org/mailman/options/antlr-interest/your-email-
> address
> >
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.


More information about the antlr-interest mailing list