[antlr-interest] Parsing Large Files

Kumar, Amitesh Amitesh.Kumar at standardbank.com
Thu Apr 1 07:26:44 PDT 2010


 
Hi Jim your correct im new to ANTLR below is my CSV grammar,

This is what im running 

CharStream lex = new ANTLRFileStream("Dealsall3.csv");
DealsAll3Lexer csv3Lexer = new DealsAll3Lexer(lex);

csv3Lexer.setBacktrackingLevel(0);	
CommonTokenStream tokens = new CommonTokenStream(csv3Lexer);	
tokens.discardOffChannelTokens(true);		
DealsAll3Parser csv3Parser = new DealsAll3Parser(tokens);

csv3Parser.file();
System.out.println(csv3Parser.getNumberOfSyntaxErrors());

You right I could fix the above by not using the ANTLRFileStream and
just using a ANTLRStringStream and chunking the file by myself outside
of ANTLR.

But my general issue is that not all my data is a simple CSV file some
will be multi line records. Hence I didn't want to keep a record of the
tokens.
Any ideas . By the way thanks for your reply.

Cheers
Kumaap0



grammar DealsAll3 ; 
    
    file        :       header ( detail )* EOF ; 
    
    SEP :       WS? ( ',') WS? ; 
    
    header : 
        'IdentID,FGamma Tot,FutDeltaTot,FutGamma
Tot,Barrier2,BarrierLevel,Cmp_CP,Cmp_Delivery,Cmp_Expiry,Cmp_Strike' 
        NL 
        ; 
    
    
    detail 
        : f_IdentID=20=20=20 
        SEP ( f_FGamma_Tot )? 
        SEP ( f_FutDeltaTot )? 
        SEP ( f_FutGamma_Tot )? 
        SEP ( f_Barrier2 )? 
        SEP ( f_Barrier_Level )? 
        SEP ( f_Cmp_CP )? 
        SEP ( f_Cmp_Delivery )? 
        SEP ( f_Cmp_Expiry )? 
        SEP ( f_Cmp_Strike )? 
        NL ; 
    
    f_IdentID           :       NUMBER ;=20=20=20=20=20=20 
    f_FGamma_Tot        :       NUMBER ; 
    f_FutDeltaTot       :       NUMBER ; 
    f_FutGamma_Tot      :       NUMBER ; 
    f_Barrier2          :       STRING ;=20 
    f_Barrier_Level     :       STRING ;=20 
    f_Cmp_CP            :       STRING ;=20 
    f_Cmp_Delivery      :       STRING ;=20 
    f_Cmp_Expiry        :       STRING ;=20 
    f_Cmp_Strike        :       STRING ;=20 
    
    DATETIME    : DATE ( SP | 'T' ) TIME ; 
    
    DATE        : 
        (       ( ( ( '0' | '1' | '2' ) '0'..'9' ) | ( '3' ( '0' | '1' )

    ) ) 
                ( '-' | '/' ) 
                (       ( '01' | '02' | '03' | '04' | '05' | '06' | '07'

    | '08' | '09' | '10' | '11' | '12' ) 
                |       ( 'JAN' | 'FEB' | 'MAR' | 'APR' | 'MAY' | 'JUN' 
    | 'JUL' | 'SEP' | 'OCT' | 'NOV' | 'DEC' ) 
                |       ( 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' 
    | 'Jul' | 'Sep' | 'Oct' | 'Nov' | 'Dec' ) 
                ) 
                ( '-' | '/' ) 
                ( ( '0'..'9' '0'..'9' )? '0'..'9' '0'..'9' ) 
        ) 
    |   (       ( '0'..'9' '0'..'9' '0'..'9' '0'..'9' ) 
                ( '-' | '/' ) 
                ( '01' | '02' | '03' | '04' | '05' | '06' | '07' | '08' 
    | '09' | '10' | '11' | '12' ) 
                ( '-' | '/' ) 
                ( ( ( '0' | '1' | '2' ) '0'..'9' ) | ( '3' ( '0' | '1' )

    ) ) 
        ) 
        ; 
    MONTH_YEAR  : 
        (       ( '01' | '02' | '03' | '04' | '05' | '06' | '07' | '08' 
    | '09' | '10' | '11' | '12' ) 
        |       ( 'JAN' | 'FEB' | 'MAR' | 'APR' | 'MAY' | 'JUN' | 'JUL' 
    | 'SEP' | 'OCT' | 'NOV' | 'DEC' ) 
        |       ( 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' | 'Jul' 
    | 'Sep' | 'Oct' | 'Nov' | 'Dec' ) 
        ) 
        '-' 
        ( ( '0'..'9' '0'..'9' )? '0'..'9' '0'..'9' ) 
        ; 
    MONTH_DAY   : 
        ( ( ( '0' | '1' | '2' ) '0'..'9' ) | ( '3' ( '0' | '1' ) ) ) 
        '-' 
        (       ( '01' | '02' | '03' | '04' | '05' | '06' | '07' | '08' 
    | '09' | '10' | '11' | '12' ) 
        |       ( 'JAN' | 'FEB' | 'MAR' | 'APR' | 'MAY' | 'JUN' | 'JUL' 
    | 'SEP' | 'OCT' | 'NOV' | 'DEC' ) 
        |       ( 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' | 'Jul' 
    | 'Sep' | 'Oct' | 'Nov' | 'Dec' ) 
        ) 
        ; 
    
    TIME        : 
        ( ( '0'..'1' '0'..'9' ) | ('2' '0'..'4' ) ) // '00' to '24' 
        ':' 
        ( '0'..'5' '0'..'9' ) // '00' to '60' 
        ':' 
        ( '0'..'5' '0'..'9' ) // '00' to '60' 
        ( ( 'Z' // UTC 
        | ( '+' | '-' ) '00' ( (':' | ' ' ) '00' )? 
        ) ? 
        ) ; 
    
    NUMBER 
        :       ( '+' | '-' )?                      // It may be signed 
        (       ( '0'..'9' )+ '.' ( '0'..'9' )*     // Decimal point 
    with leading and trailing digits 
        |        '.' ( '0'..'9' )+                  // or it may be just

    a mantissa 
        |       '0'..'9'+                           // or it may be an 
    integer 
        ) 
        ; 
    
    STRING 
        :       ('"') VALID_CHAR+ ('"')     // Must have quotes at both 
    ends 
        |       VALID_CHAR+                 // or no quote at either end

        ; 
    
    fragment VALID_CHAR : 
        ( 'a'..'z' | 'A'..'Z' | '0'..'9' // the alphanumeric characters 
        |       ' '     // x20 =3D SPACE 
        |       '!'     // x21 =3D EXCLAMATION MARK 
        |       '#'     // x23 =3D NUMBER SIGN 
        |       '$'     // x24 =3D DOLLAR SIGN 
        |       '%'     // x25 =3D PERCENT SIGN 
        |       '&'     // x26 =3D AMPERSAND 
        |       '('     // x28 =3D LEFT PARENTHESIS 
        |       ')'     // x29 =3D RIGHT PARENTHESIS 
        |       '*'     // x2a =3D ASTERISK 
        |       '+'     // x2b =3D PLUS SIGN 
        // SEP char ',' // x2c =3D COMMA 
        |       '-'     // x2d =3D HYPHEN-MINUS 
        |       '.'     // x2e =3D FULL STOP 
        |       '/'     // x2f =3D SOLIDUS 
        |       ':'     // x3a =3D COLON 
        |       ';'     // x3b =3D SEMICOLON 
        |       '<'     // x3c =3D LESS-THAN SIGN 
        |       '=3D'   // x3d =3D EQUALS SIGN 
        |       '>'     // x3e =3D GREATER-THAN SIGN 
        |       '?'     // x3f =3D QUESTION MARK 
        |       '@'     // x40 =3D COMMERCIAL AT 
        |       '['     // x5b =3D LEFT SQUARE BRACKET 
        |       ']'     // x5d =3D RIGHT SQUARE BRACKET 
        |       '^'     // x5e =3D CIRCUMFLEX ACCENT 
        |       '_'     // x5f =3D LOW LINE 
        |       '`'     // x60 =3D GRAVE ACCENT 
        |       '{'     // x7b =3D LEFT CURLY BRACKET 
        |       '|'     // x7c =3D VERTICAL LINE 
        |       '}'     // x7d =3D RIGHT CURLY BRACKET 
        |       '~'     // x7e =3D TILDE 
        ) ;


-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Jim Idle
Sent: 01 April 2010 14:58
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Parsing Large Files

The other possibility is of course that you are trying to parse a
massive file in one lump. You probably just want to reinvoke the parser
for each deal record (break it up in the string tream.
Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest- 
> bounces at antlr.org] On Behalf Of Kumar, Amitesh
> Sent: Thursday, April 01, 2010 2:13 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Parsing Large Files
> 
> Hi Guys what we are looking for is just parsing the file and recording

> the errors we don't need to keep a track of any tokens or a AST.
> Im getting
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2760)
>         at java.util.Arrays.copyOf(Arrays.java:2734)
>         at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
>         at java.util.ArrayList.add(ArrayList.java:351)
>         at
> org.antlr.runtime.CommonTokenStream.fillBuffer(CommonTokenStream.java:
> 1
> 1
> 6)
>         at
> org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:238)
>         at
> org.antlr.runtime.Parser.getCurrentInputSymbol(Parser.java:54)
>         at
> org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:104)
>         at DealsAll2Parser.header(DealsAll2Parser.java:123)
>         at DealsAll2Parser.file(DealsAll2Parser.java:67)
>         at AntlrMain.main(AntlrMain.java:53) I see where the error is 
> coming from the CommonTokenStream is keeping track of all past tokens,

> how can I make it so it doesn't. Do I have to create my own Token 
> Stream? Or is there a easy way.
> 
> Cheers
> Kumaap0
> 
> 
> **********************************************************************
> *
> ******
> More information on Standard Bank is available at www.standardbank.com
> 
> Everything in this email and any attachments relating to the official 
> business of Standard Bank Group Limited and any or all subsidiaries, 
> ("the Company"), is proprietary to the Company. It is confidential, 
> legally privileged and protected by relevant laws. The Company does 
> not own and endorse any other content.
> Views and opinions are those of the sender unless clearly stated as 
> being that of the Company.
> 
> The person or persons addressed in this email are the sole authorised 
> recipient. Please notify the sender immediately if it has 
> unintentionally, or inadvertently reached you and do not read, 
> disclose or use the content in any way and delete this e-mail from 
> your system.
> 
> The Company cannot ensure that the integrity of this email has been 
> maintained nor that it is free of errors, virus, interception or 
> interference.
> The sender therefore does not accept liability for any errors or 
> omissions in the contents of this message which arise as a result of 
> e-mail transmission.
> If verification is required please request a hard-copy version. This 
> message is provided for informational purposes and should not be 
> construed as a solicitation or offer to buy or sell any securities or 
> related financial instruments.
> **********************************************************************
> *
> ******
> 
> 
> This message has been scanned for viruses by BlackSpider MailControl -

> www.blackspider.com
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address




List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address


More information about the antlr-interest mailing list