[antlr-interest] Help with grammer for IRC TEXT
afleance
afleance at yahoo.com
Sun Mar 16 10:40:38 PST 2003
Hello,
This is my first try at creating a ANTLR grammer file.
I have read through the ANTLR FAQ and the excellent article "Lexical
Analysis with ANTLR", so I know enough to be dangerous, but I think
I'm missing something fundamental.
I am trying to parse text from an IRC channel into the following
tokens.
URL
FLOAT
INT
WORD
WS
IRC_BOLD /* CTRL-B toggles text bold */
IRC_PLAIN /* CTRL-O turns off text decoration */
IRC_UNDERLINE /* CTRL-U toggles text underlined */
IRC_REVERSE /* CTRL-R toggles text reversed */
IRC_COLOR /* CTRL-K INT, e.g. CTRL-K12, makes following text
colored (00 - 16) */
NONWORD (anything else)
Using my IRCLexer class, I am converting the IRC text tokens into the
equivalent HTML.
Well, the problem is I am not getting all the tokens I expect. When I
compile the file I get lots of non-determinism warnings, so I know
something is wrong.
The URL, IRC_BOLD, IRC_PLAIN, IRC_UNDERLINE, IRC_REVERSE, and
IRC_COLOR,
and WS tokesn are handled fine, but I'm not getting FLOAT, INT,
WORD, and
NONWORD returned correctly.
Can someone tell me what I'm doing wrong here?
Also, certain text is causing an Exception. For example, if I type
"http:// " I get the error "line 1:8 unexpected char: 0x?F".
How can I get ANTLR to *never* throw an Exception, and just
pass through text which doesn't match anything. I tried doing
that with the NONWORD rule.
Here are the example text I am trying in the file
<CTRL-B>test<CTRL-O>test
IRC_BOLD IRC_WORD IRC_PLAIN IRC_WORD
http://www.cnn.com
URL
23.45 -2333.555
FLOAT WS INT
23.555http://www.cnn.com?num=23.445<CTRL-K>09-23.555
FLOAT URL IRC_COLOR(09) FLOAT
----
class IRCLexer extends Lexer;
options {
k=8;
filter=false;
/* all 8 bit chars */
charVocabulary = '\u0000'..'\u00FF';
}
URL : HTTP ( LETTER | DIGIT | URL_SPECIAL_CHAR )+
{ System.out.println("URL "+getText()); }
;
IRC_BOLD : '\002' /* CTRL-B*/
{ System.out.println("IRC_BOLD"); }
;
IRC_PLAIN : '\u000f' /*CTRL-O*/
{ System.out.println("IRC_PLAIN"); }
;
IRC_UNDERLINE : '\u0015' /*CTRL-U*/
{ System.out.println("IRC_UNDERLINE"); }
;
IRC_REVERSE : '\u0016' /*CTRL-R*/
{ System.out.println("IRC_REVERSE"); }
;
IRC_COLOR : '\u0003' /*CTRL-K*/ i:INT_2SD
{
System.out.println("IRC_COLOR = "+ i.getText());
setText(i.getText());
}
;
IRC_WORD : ( LETTER | DIGIT | '_' )
{ System.out.println("IRC_WORD: "+getText()); }
;
FLOAT_OR_INT : ( INT '.' ) => FLOAT
{
$setType(FLOAT);
System.out.println("FLOAT: "+getText());
}
| ( INT )
{
$setType(INT);
System.out.println("INT: "+getText());
}
;
FLOAT : INT '.' UNSIGNED_INT
{ System.out.println("FLOAT : "+getText()); }
;
INT : ( '-' UNSIGNED_INT )
{ System.out.println("INT: "+getText()); }
;
WS : ( ' '
| '\t'
| '\r' '\n' { newline(); }
| '\n' { newline(); }
)
{
System.out.println("WS");
/* I want to return WS as tokens
$setType(Token.SKIP);
*/
}
;
/* Catchall, pass through everything not matched above ?? */
NONWORD : .
{ System.out.println("NONWORD: '"+getText()+"'"); }
;
//protected means the token can only be called from another lexer
rule;
// it will not ever directly return a token to the parser.
protected
HTTP : "http://"
{ System.out.println("http:// "+getText()); }
;
URL_SPECIAL_CHAR : ('$' | '-' | '_' | '@' | '.' | '&' | '+' |
'!' | '*' | '"' | '\'' | '(' | ')' | ',' |
'=' | ';' | '/' | '#' | '?' | '\\'':' | '%' )
{ System.out.println("URL_SPECIAL_CHAR: "+getText()); }
;
LETTER : ('a'..'z'|'A'..'Z')
/*
{ System.out.println("LETTER: "+getText()); }
*/
;
UNSIGNED_INT : (DIGIT)+
{ System.out.println("UNSIGNED_INT: "+getText()); }
;
DIGIT : ('0'..'9')
/*
{ System.out.println("Found numeric: "+getText()); }
*/
;
/* special rule to match either 1 or 2 digit integers
used by IRC_COLOR above */
INT_2SD : (DIGIT)(DIGIT)?
{ System.out.println("INT_2SD: "+getText()); }
;
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list