[antlr-interest] java : code too large for try statement => use another language?

Jim Idle jimi at temporal-wave.com
Fri Jul 18 10:19:27 PDT 2008


On Fri, 2008-07-18 at 12:51 -0400, Laura Adam wrote:

> Hello,
> 
> I am implementing in Anltr through Java, a program that takes a DNA sequence as input and recognizes the different parts such as promoters, linkers...
> 
> Those parts are thousands letters long.
> I used rules like :
> 
> E0020 :
>         ("atggtgagcaagggcgaggagctgttcaccggggtggtgcccatcctggtcgagctggacggcgacgtgaacggccacaagttcagcgtgtccggcgagggcgagggcgatgccacctacggcaagctgaccctgaagttcatctgcaccaccggcaagctgcccgtgccctggcccaccctcgtgaccaccctgacctggggcgtgcagtgcttcagccgctaccccgaccacatgaagcagcacgacttcttcaagtccgccatgcccgaaggctacgtccaggagcgcaccatcttcttcaaggacgacggcaactacaagacccgcgccgaggtgaagttcgagggcgacaccctggtgaaccgcatcgagctgaagggcatcgacttcaaggaggacggcaacatcctggggcacaagctggagtacaactacatcagccacaacgtctatatcaccgccgacaagcagaagaacggcatcaaggccaacttcaagatccgccacaacatcgaggacggcagcgtgcagctcgccgaccactaccagcagaacacccccatcggcgacggccccgtgctgctgcccgacaaccactacctgagcacccagtccgccctgagcaaagaccccaacgagaagcgcgatcacatggtcctgctggagttcgtgaccgccgccgggatcactctcggcatggacgagctgtacaagtaataa");
> 
> E0022 :
>         ("atggtgagcaagggcgaggagctgttcaccggggtggtgcccatcctggtcgagctggacggcgacgtgaacggccacaagttcagcgtgtccggcgagggcgagggcgatgccacctacggcaagctgaccctgaagttcatctgcaccaccggcaagctgcccgtgccctggcccaccctcgtgaccaccctgacctggggcgtgcagtgcttcagccgctaccccgaccacatgaagcagcacgacttcttcaagtccgccatgcccgaaggctacgtccaggagcgcaccatcttcttcaaggacgacggcaactacaagacccgcgccgaggtgaagttcgagggcgacaccctggtgaaccgcatcgagctgaagggcatcgacttcaaggaggacggcaacatcctggggcacaagctggagtacaactacatcagccacaacgtctatatcaccgccgacaagcagaagaacggcatcaaggccaacttcaagatccgccacaacatcgaggacggcagcgtgcagctcgccgaccactaccagcagaacacccccatcggcgacggccccgtgctgctgcccgacaaccactacctgagcacccagtccgccctgagcaaagaccccaacgagaagcgcgatcacatggtcctgctggagttcgtgaccgccgccgggatcactctcggcatggacgagctgtacaagaggcctgctgcaaacgacgaaaactacgctttagtagcttaataa");
> 
> >If I want a 1000 lookahead characters the file.g is ok. I have no lexical nondeterminism. But when I want to use the file.g with my test_lexer.java I have that error from java: "code too large for try statement".
> >If I put a 200 lookahead characters I have lexical nondeterminism but I can compile and run the java program (I don't get the result I want of course).
> 
> --> What can I do?
> I think I could add intermediates in my rules but it would be long (I thought to make packs of 4 or 5 letters), non general, and not friendly to user. I don't really like that solution.
> Maybe I should turn to another language such as C. Will I have the same problem? 


C will just use whatever memory it has and generally the compilers will
will compile as large a program as you give it, although it might take
some time. At a guess from what you are trying to do, ANTLR isn't the
right program, or you would need to break down the logical sequences
lexically and contruct them in a parser from all the common prefixes and
so on, but the combinatorial possibilities are probably too huge.

Basically, while the algorithms can process your very long sequences, it
will generate huge amounts of code to do this. Ironically, you might be
better with flex for the lexer, as this will just go down the listed
possibilities in order and try each one to see if it works. But  unless
you are combining these tokens in to complicated syntax then you are
probably better off using awk or possibly even sed.

Finally through you could write a program to generate the lexer rules,
factoring out all the common sequences. Left factoring will reduce the
code size a lot. IN your two example tokens above, there is a large
common start sequence. Rather than clutter the examples, lets say the
common start sequence is : atggtgag" and then it differs after that,
then you can do:

fragment E0020 : 'x' ; // Place holder
E0022 : 'atggtgag'  // Common left sequence

                (
                      'cag'  // E0022
                    | 'ggta' { $type = E0020; }
                )
;

But I still think that you will end up with something fairly unworkable
unless you really can take all the input sequences and generate this
left factoring automatically, but if you are going to do that, you might
as well write a program for the recognition perhaps. 

Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080718/391feeaa/attachment-0001.html 


More information about the antlr-interest mailing list