[antlr-interest] Help with grammer for IRC TEXT

Sun Mar 16 10:40:38 PST 2003

Hello,

This is my first try at creating a ANTLR grammer file.
I have read through the ANTLR FAQ and the excellent article "Lexical
Analysis with ANTLR", so I know enough to be dangerous, but I think
I'm missing something fundamental.

I am trying to parse text from an IRC channel into the following 
tokens.

URL
FLOAT
INT
WORD
WS

IRC_BOLD   /* CTRL-B toggles text bold */
IRC_PLAIN  /* CTRL-O turns off text decoration */
IRC_UNDERLINE /* CTRL-U toggles text underlined */
IRC_REVERSE  /* CTRL-R toggles text reversed */
IRC_COLOR  /* CTRL-K INT, e.g. CTRL-K12, makes following text 
colored (00 - 16) */

NONWORD (anything else)

Using my IRCLexer class, I am converting the IRC text tokens into the
equivalent HTML.

Well, the problem is I am not getting all the tokens I expect. When I
compile the file I get lots of non-determinism warnings, so I know 
something is wrong.

The URL, IRC_BOLD, IRC_PLAIN, IRC_UNDERLINE, IRC_REVERSE, and 
IRC_COLOR,
and WS tokesn are handled fine, but I'm not getting FLOAT, INT, 
WORD, and
NONWORD returned correctly.

Can someone tell me what I'm doing wrong here?

Also, certain text is causing an Exception. For example, if I type
"http://  " I get the error "line 1:8 unexpected char: 0x?F".

How can I get ANTLR to *never* throw an Exception, and just
pass through text which doesn't match anything.  I tried doing
that with the NONWORD rule.

Here are the example text I am trying in the file

<CTRL-B>test<CTRL-O>test
IRC_BOLD IRC_WORD IRC_PLAIN IRC_WORD

http://www.cnn.com 
URL

23.45 -2333.555
FLOAT WS INT

23.555http://www.cnn.com?num=23.445<CTRL-K>09-23.555
FLOAT URL IRC_COLOR(09) FLOAT

----

class IRCLexer extends Lexer;

options {
	k=8;
	filter=false;
	/* all 8 bit chars */
	charVocabulary = '\u0000'..'\u00FF';
}

URL     : HTTP ( LETTER | DIGIT | URL_SPECIAL_CHAR )+
	{ System.out.println("URL "+getText()); }
	;
IRC_BOLD    : '\002' /* CTRL-B*/
	{ System.out.println("IRC_BOLD"); }
	;
IRC_PLAIN   :  '\u000f' /*CTRL-O*/
	{ System.out.println("IRC_PLAIN"); }
	;
IRC_UNDERLINE : '\u0015' /*CTRL-U*/
	  { System.out.println("IRC_UNDERLINE"); }
	  ;
IRC_REVERSE : '\u0016' /*CTRL-R*/
	{ System.out.println("IRC_REVERSE"); }
	;
IRC_COLOR   : '\u0003' /*CTRL-K*/  i:INT_2SD
	{  
	  System.out.println("IRC_COLOR = "+ i.getText());
	  setText(i.getText());
	}
	;

IRC_WORD : ( LETTER | DIGIT | '_' )
        { System.out.println("IRC_WORD: "+getText()); }
        ;

FLOAT_OR_INT : ( INT '.' ) => FLOAT 
	     { 
	     $setType(FLOAT); 
	     System.out.println("FLOAT: "+getText()); 
	     }
	     | ( INT )
	     { 
	     $setType(INT); 
	     System.out.println("INT: "+getText()); 
	     }
	;

FLOAT : INT '.' UNSIGNED_INT
        { System.out.println("FLOAT : "+getText()); }
        ;

INT : ( '-' UNSIGNED_INT )
        { System.out.println("INT: "+getText()); }
        ;

WS  :   (   ' '
        |   '\t'
        |   '\r' '\n' { newline(); }
        |   '\n'      { newline(); }
        )
        {
	System.out.println("WS");	
/*      I want to return WS as tokens
	$setType(Token.SKIP);
*/
	} 
    ;

/* Catchall, pass through everything not matched above ?? */
NONWORD : . 
        { System.out.println("NONWORD: '"+getText()+"'"); }
	;

//protected means the token  can only be called from another lexer 
rule; 
// it will not ever directly return a token to the parser.
protected                                    

HTTP    : "http://"
	{ System.out.println("http:// "+getText()); }
	;

URL_SPECIAL_CHAR : ('$' | '-' | '_' | '@' | '.' | '&' | '+' |
		 '!' | '*' | '"' | '\'' | '(' | ')' | ',' |
		 '=' | ';' | '/' | '#' | '?' | '\\'':' | '%' )
	  { System.out.println("URL_SPECIAL_CHAR: "+getText()); }
	  ;

LETTER : ('a'..'z'|'A'..'Z')
/*
        { System.out.println("LETTER: "+getText()); }
*/
	;

UNSIGNED_INT : (DIGIT)+
        { System.out.println("UNSIGNED_INT: "+getText()); }
        ;

DIGIT : ('0'..'9')
/*
        { System.out.println("Found numeric: "+getText()); }
*/
        ;

/* special rule to match either 1 or 2 digit integers
   used by IRC_COLOR above */
INT_2SD : (DIGIT)(DIGIT)?
        { System.out.println("INT_2SD: "+getText()); }
        ;

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/