[antlr-interest] skip lines until pattern

Chris Black cblack0 at yahoo.com
Mon Apr 12 21:34:31 PDT 2004


I would like to parse a file format that has a bunch of headers (that
I don't care about for the moment) and then tab seperated values. I
have parsed tab-seperated value files before without problems, but
skipping over the header is really driving me batty. I have tried
numerous things, right now I have a seperate lexer with filter=true
only matching the last line in the header and then calling a change on
the selector to move to another lexer. For some reason I am having
VERY odd behavior not matching the first newline in the header (which
is the first newline in the file). I do not know if this is the root
of my problems, but my parser never matches or prints anything.

The file looks like:
Some Space String Here
Fieldname: value with spaces or other chars [TAB] Fieldname2: value
[ a few more lines like this, not all with the same number of fields
per line]

Magic Date: Mmm dd, yyyy
start [tab]of  [tab]seperated  [tab]stuff
I     [tab]can [tab]handle     [tab]ok


Note that I don't care about anything but skipping to the line after 
the "Magic Date:" line. In theory I may want to do more with the
header data later, but at this point I've spent over four hours trying
to just skip lines properly.

Here is how I am trying to do it:
Header.g:
class HeaderLexer extends Lexer;

options {
	k = 3;
	filter=true;
}

protected
CHAR: ':' | ' ' | ',' | '_'  | '.' | '\t' | 'A'..'Z' | 'a'..'z' |
'0'..'9' ;
protected
EXPORTDATE: "Magic Date:" ;

ENDOFHEADERLINE: e:EXPORTDATE
	{ System.err.println("Found EXPORTDATE string"); }
	(.)+
	{ System.err.println("End of skip header at line " + e.getLine()); 
		Importer.selector.push("main");
	} ;
NEWLINE: ( "\r\n" // DOS
    | '\r'   // MAC
    | '\n'   // Unix
    )
    { newline(); System.err.println("NEWLINE in Header.g: " +
getLine());
    }
  ;

---
I've tried a few variations, including trying to make a (~ '\n')
class, use the CHAR class, or just use '.'.



Data.g:
class DataLexer extends Lexer;

options {
  k=2;
}

protected DOT: '.' ;
protected COLON: ':' ;
protected COMMA: ',' ;
protected HASH: '#' ;
protected SPACE: ' ' ;
protected FIELDCHAR: ('a'..'z' | 'A'..'Z' | '-'  | '0'..'9' | DOT 
	| COLON | COMMA | SPACE) ;
TAB: '\t' ;
FIELD: (FIELDCHAR)+ 
{ System.err.println("FIELD: " + "found"); }  ;
NEWLINE: ( "\r\n" // DOS
    | '\r'   // MAC
    | '\n'   // Unix
    )
    { newline(); System.err.println("data lexer NEWLINE: " +
getLine());
    }
  ;

---
Within Data.g I had a parser as well, but I've just been trying
anything to get it to print, right now it just looks like:
class DataParser extends Parser:
options {
  k=4;
  buildAST=true;
}

//contents: f:FIELD { System.err.println("found field at " +
f.getLine()); } ;
contents: n:NEWLINE { System.err.println("found newline at " +
n.getLine()); };
---
My Main.java (called Importer) does:
DataInputStream input = new DataInputStream(new FileInputStream(infile));
      HeaderLexer header = new HeaderLexer(input);
      DataLexer main = new DataLexer(header.getInputState());
      
      selector.addInputStream(header,"header");
      selector.addInputStream(main,"main");
      selector.select("header");
      System.err.println("header lexer selected");
      DataParser parser = new DataParser(selector);
      System.err.println("dataparser created/attached to selector");
      parser.contents();
      System.err.println("parser.contents called");
      System.err.flush();
---

When I run this I get output from both lexers as expected, but I can
never get the parser to print anything. I also have an odd output at
the beginning of the file:
line 1:110: expecting NEWLINE, found '
'
NEWLINE in Header.g: 3

But line 1 does not HAVE 110 characters. I've also played around with
this, ignoring newlines, and then I get an error like "line
1:27123..." but the line has less than 20000 characters. Am I
misinterpreting these error messages? Also, I have tracked down the
"expecting NEWLINE" error to the parser.

After looking at the multiLexer example, I've also tried to create a
CommonTokenTypes.txt file and import it, but I wasn't able to get that
to work either. (I could not figure out the proper way to create and
use that file from the multiLexer example).

Does anyone have any tips or a better way to skip lines of a file
until a certain pattern is matched?

Thanks in advance,
Chris




 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list