[antlr-interest] Parsing Large Files
Kumar, Amitesh
Amitesh.Kumar at standardbank.com
Thu Apr 1 07:26:44 PDT 2010
Hi Jim your correct im new to ANTLR below is my CSV grammar,
This is what im running
CharStream lex = new ANTLRFileStream("Dealsall3.csv");
DealsAll3Lexer csv3Lexer = new DealsAll3Lexer(lex);
csv3Lexer.setBacktrackingLevel(0);
CommonTokenStream tokens = new CommonTokenStream(csv3Lexer);
tokens.discardOffChannelTokens(true);
DealsAll3Parser csv3Parser = new DealsAll3Parser(tokens);
csv3Parser.file();
System.out.println(csv3Parser.getNumberOfSyntaxErrors());
You right I could fix the above by not using the ANTLRFileStream and
just using a ANTLRStringStream and chunking the file by myself outside
of ANTLR.
But my general issue is that not all my data is a simple CSV file some
will be multi line records. Hence I didn't want to keep a record of the
tokens.
Any ideas . By the way thanks for your reply.
Cheers
Kumaap0
grammar DealsAll3 ;
file : header ( detail )* EOF ;
SEP : WS? ( ',') WS? ;
header :
'IdentID,FGamma Tot,FutDeltaTot,FutGamma
Tot,Barrier2,BarrierLevel,Cmp_CP,Cmp_Delivery,Cmp_Expiry,Cmp_Strike'
NL
;
detail
: f_IdentID=20=20=20
SEP ( f_FGamma_Tot )?
SEP ( f_FutDeltaTot )?
SEP ( f_FutGamma_Tot )?
SEP ( f_Barrier2 )?
SEP ( f_Barrier_Level )?
SEP ( f_Cmp_CP )?
SEP ( f_Cmp_Delivery )?
SEP ( f_Cmp_Expiry )?
SEP ( f_Cmp_Strike )?
NL ;
f_IdentID : NUMBER ;=20=20=20=20=20=20
f_FGamma_Tot : NUMBER ;
f_FutDeltaTot : NUMBER ;
f_FutGamma_Tot : NUMBER ;
f_Barrier2 : STRING ;=20
f_Barrier_Level : STRING ;=20
f_Cmp_CP : STRING ;=20
f_Cmp_Delivery : STRING ;=20
f_Cmp_Expiry : STRING ;=20
f_Cmp_Strike : STRING ;=20
DATETIME : DATE ( SP | 'T' ) TIME ;
DATE :
( ( ( ( '0' | '1' | '2' ) '0'..'9' ) | ( '3' ( '0' | '1' )
) )
( '-' | '/' )
( ( '01' | '02' | '03' | '04' | '05' | '06' | '07'
| '08' | '09' | '10' | '11' | '12' )
| ( 'JAN' | 'FEB' | 'MAR' | 'APR' | 'MAY' | 'JUN'
| 'JUL' | 'SEP' | 'OCT' | 'NOV' | 'DEC' )
| ( 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun'
| 'Jul' | 'Sep' | 'Oct' | 'Nov' | 'Dec' )
)
( '-' | '/' )
( ( '0'..'9' '0'..'9' )? '0'..'9' '0'..'9' )
)
| ( ( '0'..'9' '0'..'9' '0'..'9' '0'..'9' )
( '-' | '/' )
( '01' | '02' | '03' | '04' | '05' | '06' | '07' | '08'
| '09' | '10' | '11' | '12' )
( '-' | '/' )
( ( ( '0' | '1' | '2' ) '0'..'9' ) | ( '3' ( '0' | '1' )
) )
)
;
MONTH_YEAR :
( ( '01' | '02' | '03' | '04' | '05' | '06' | '07' | '08'
| '09' | '10' | '11' | '12' )
| ( 'JAN' | 'FEB' | 'MAR' | 'APR' | 'MAY' | 'JUN' | 'JUL'
| 'SEP' | 'OCT' | 'NOV' | 'DEC' )
| ( 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' | 'Jul'
| 'Sep' | 'Oct' | 'Nov' | 'Dec' )
)
'-'
( ( '0'..'9' '0'..'9' )? '0'..'9' '0'..'9' )
;
MONTH_DAY :
( ( ( '0' | '1' | '2' ) '0'..'9' ) | ( '3' ( '0' | '1' ) ) )
'-'
( ( '01' | '02' | '03' | '04' | '05' | '06' | '07' | '08'
| '09' | '10' | '11' | '12' )
| ( 'JAN' | 'FEB' | 'MAR' | 'APR' | 'MAY' | 'JUN' | 'JUL'
| 'SEP' | 'OCT' | 'NOV' | 'DEC' )
| ( 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' | 'Jul'
| 'Sep' | 'Oct' | 'Nov' | 'Dec' )
)
;
TIME :
( ( '0'..'1' '0'..'9' ) | ('2' '0'..'4' ) ) // '00' to '24'
':'
( '0'..'5' '0'..'9' ) // '00' to '60'
':'
( '0'..'5' '0'..'9' ) // '00' to '60'
( ( 'Z' // UTC
| ( '+' | '-' ) '00' ( (':' | ' ' ) '00' )?
) ?
) ;
NUMBER
: ( '+' | '-' )? // It may be signed
( ( '0'..'9' )+ '.' ( '0'..'9' )* // Decimal point
with leading and trailing digits
| '.' ( '0'..'9' )+ // or it may be just
a mantissa
| '0'..'9'+ // or it may be an
integer
)
;
STRING
: ('"') VALID_CHAR+ ('"') // Must have quotes at both
ends
| VALID_CHAR+ // or no quote at either end
;
fragment VALID_CHAR :
( 'a'..'z' | 'A'..'Z' | '0'..'9' // the alphanumeric characters
| ' ' // x20 =3D SPACE
| '!' // x21 =3D EXCLAMATION MARK
| '#' // x23 =3D NUMBER SIGN
| '$' // x24 =3D DOLLAR SIGN
| '%' // x25 =3D PERCENT SIGN
| '&' // x26 =3D AMPERSAND
| '(' // x28 =3D LEFT PARENTHESIS
| ')' // x29 =3D RIGHT PARENTHESIS
| '*' // x2a =3D ASTERISK
| '+' // x2b =3D PLUS SIGN
// SEP char ',' // x2c =3D COMMA
| '-' // x2d =3D HYPHEN-MINUS
| '.' // x2e =3D FULL STOP
| '/' // x2f =3D SOLIDUS
| ':' // x3a =3D COLON
| ';' // x3b =3D SEMICOLON
| '<' // x3c =3D LESS-THAN SIGN
| '=3D' // x3d =3D EQUALS SIGN
| '>' // x3e =3D GREATER-THAN SIGN
| '?' // x3f =3D QUESTION MARK
| '@' // x40 =3D COMMERCIAL AT
| '[' // x5b =3D LEFT SQUARE BRACKET
| ']' // x5d =3D RIGHT SQUARE BRACKET
| '^' // x5e =3D CIRCUMFLEX ACCENT
| '_' // x5f =3D LOW LINE
| '`' // x60 =3D GRAVE ACCENT
| '{' // x7b =3D LEFT CURLY BRACKET
| '|' // x7c =3D VERTICAL LINE
| '}' // x7d =3D RIGHT CURLY BRACKET
| '~' // x7e =3D TILDE
) ;
-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Jim Idle
Sent: 01 April 2010 14:58
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Parsing Large Files
The other possibility is of course that you are trying to parse a
massive file in one lump. You probably just want to reinvoke the parser
for each deal record (break it up in the string tream.
Jim
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Kumar, Amitesh
> Sent: Thursday, April 01, 2010 2:13 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Parsing Large Files
>
> Hi Guys what we are looking for is just parsing the file and recording
> the errors we don't need to keep a track of any tokens or a AST.
> Im getting
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2760)
> at java.util.Arrays.copyOf(Arrays.java:2734)
> at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
> at java.util.ArrayList.add(ArrayList.java:351)
> at
> org.antlr.runtime.CommonTokenStream.fillBuffer(CommonTokenStream.java:
> 1
> 1
> 6)
> at
> org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:238)
> at
> org.antlr.runtime.Parser.getCurrentInputSymbol(Parser.java:54)
> at
> org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:104)
> at DealsAll2Parser.header(DealsAll2Parser.java:123)
> at DealsAll2Parser.file(DealsAll2Parser.java:67)
> at AntlrMain.main(AntlrMain.java:53) I see where the error is
> coming from the CommonTokenStream is keeping track of all past tokens,
> how can I make it so it doesn't. Do I have to create my own Token
> Stream? Or is there a easy way.
>
> Cheers
> Kumaap0
>
>
> **********************************************************************
> *
> ******
> More information on Standard Bank is available at www.standardbank.com
>
> Everything in this email and any attachments relating to the official
> business of Standard Bank Group Limited and any or all subsidiaries,
> ("the Company"), is proprietary to the Company. It is confidential,
> legally privileged and protected by relevant laws. The Company does
> not own and endorse any other content.
> Views and opinions are those of the sender unless clearly stated as
> being that of the Company.
>
> The person or persons addressed in this email are the sole authorised
> recipient. Please notify the sender immediately if it has
> unintentionally, or inadvertently reached you and do not read,
> disclose or use the content in any way and delete this e-mail from
> your system.
>
> The Company cannot ensure that the integrity of this email has been
> maintained nor that it is free of errors, virus, interception or
> interference.
> The sender therefore does not accept liability for any errors or
> omissions in the contents of this message which arise as a result of
> e-mail transmission.
> If verification is required please request a hard-copy version. This
> message is provided for informational purposes and should not be
> construed as a solicitation or offer to buy or sell any securities or
> related financial instruments.
> **********************************************************************
> *
> ******
>
>
> This message has been scanned for viruses by BlackSpider MailControl -
> www.blackspider.com
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address
More information about the antlr-interest
mailing list