[antlr-interest] skip lines until pattern

Tue Apr 13 03:58:03 PDT 2004

Chris Black wrote:
> I would like to parse a file format that has a bunch of headers (that
> I don't care about for the moment) and then tab seperated values. I
> have parsed tab-seperated value files before without problems, but
> skipping over the header is really driving me batty. I have tried
> numerous things, right now I have a seperate lexer with filter=true
> only matching the last line in the header and then calling a change on
> the selector to move to another lexer. For some reason I am having
> VERY odd behavior not matching the first newline in the header (which
> is the first newline in the file). I do not know if this is the root
> of my problems, but my parser never matches or prints anything.
> 
> The file looks like:
> Some Space String Here
> Fieldname: value with spaces or other chars [TAB] Fieldname2: value
> [ a few more lines like this, not all with the same number of fields
> per line]
> 
> Magic Date: Mmm dd, yyyy
> start [tab]of  [tab]seperated  [tab]stuff
> I     [tab]can [tab]handle     [tab]ok
> 
> 
> Note that I don't care about anything but skipping to the line after 
> the "Magic Date:" line. In theory I may want to do more with the
> header data later, but at this point I've spent over four hours trying
> to just skip lines properly.
> 
> Here is how I am trying to do it:
> Header.g:
> class HeaderLexer extends Lexer;
> 
> options {
> 	k = 3;
> 	filter=true;
> }
> 
> protected
> CHAR: ':' | ' ' | ',' | '_'  | '.' | '\t' | 'A'..'Z' | 'a'..'z' |
> '0'..'9' ;
> protected
> EXPORTDATE: "Magic Date:" ;
> 
> ENDOFHEADERLINE: e:EXPORTDATE
> 	{ System.err.println("Found EXPORTDATE string"); }
> 	(.)+
> 	{ System.err.println("End of skip header at line " + e.getLine()); 
> 		Importer.selector.push("main");
> 	} ;
> NEWLINE: ( "\r\n" // DOS
>     | '\r'   // MAC
>     | '\n'   // Unix
>     )
>     { newline(); System.err.println("NEWLINE in Header.g: " +
> getLine());
>     }
>   ;
> 
> ---
> I've tried a few variations, including trying to make a (~ '\n')
> class, use the CHAR class, or just use '.'.
> 
> 
> 
> Data.g:
> class DataLexer extends Lexer;
> 
> options {
>   k=2;
> }
> 
> protected DOT: '.' ;
> protected COLON: ':' ;
> protected COMMA: ',' ;
> protected HASH: '#' ;
> protected SPACE: ' ' ;
> protected FIELDCHAR: ('a'..'z' | 'A'..'Z' | '-'  | '0'..'9' | DOT 
> 	| COLON | COMMA | SPACE) ;
> TAB: '\t' ;
> FIELD: (FIELDCHAR)+ 
> { System.err.println("FIELD: " + "found"); }  ;
> NEWLINE: ( "\r\n" // DOS
>     | '\r'   // MAC
>     | '\n'   // Unix
>     )
>     { newline(); System.err.println("data lexer NEWLINE: " +
> getLine());
>     }
>   ;
> 
> ---
> Within Data.g I had a parser as well, but I've just been trying
> anything to get it to print, right now it just looks like:
> class DataParser extends Parser:
> options {
>   k=4;
>   buildAST=true;
> }
> 
> //contents: f:FIELD { System.err.println("found field at " +
> f.getLine()); } ;
> contents: n:NEWLINE { System.err.println("found newline at " +
> n.getLine()); };
> ---
> My Main.java (called Importer) does:
> DataInputStream input = new DataInputStream(new FileInputStream(infile));
>       HeaderLexer header = new HeaderLexer(input);
>       DataLexer main = new DataLexer(header.getInputState());
>       
>       selector.addInputStream(header,"header");
>       selector.addInputStream(main,"main");
>       selector.select("header");
>       System.err.println("header lexer selected");
>       DataParser parser = new DataParser(selector);
>       System.err.println("dataparser created/attached to selector");
>       parser.contents();
>       System.err.println("parser.contents called");
>       System.err.flush();
> ---
> 
> When I run this I get output from both lexers as expected, but I can
> never get the parser to print anything. I also have an odd output at
> the beginning of the file:
> line 1:110: expecting NEWLINE, found '
> '
> NEWLINE in Header.g: 3
> 
> But line 1 does not HAVE 110 characters. I've also played around with
> this, ignoring newlines, and then I get an error like "line
> 1:27123..." but the line has less than 20000 characters. Am I
> misinterpreting these error messages? Also, I have tracked down the
> "expecting NEWLINE" error to the parser.
> 
> After looking at the multiLexer example, I've also tried to create a
> CommonTokenTypes.txt file and import it, but I wasn't able to get that
> to work either. (I could not figure out the proper way to create and
> use that file from the multiLexer example).
> 
> Does anyone have any tips or a better way to skip lines of a file
> until a certain pattern is matched?
> 
> Thanks in advance,
> Chris
> 
> 
A simple solution is to read the file with a BufferedReader and consume all 
lines until the one which  begins with the Magic Date.Consume that line also and 
then pass the Reader to the Lexer.
It is fast to implement and will be faster then a Lexer.

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/