[antlr-interest] parsing syslog entries...

inder sabharwal inder.sabharwal at gmail.com
Fri Jun 15 17:41:11 PDT 2007


Hi-

I am trying to write a tool for parsing Syslog entries and am facing 
issues - which I am unable to get answers to inspite of reading the 
Definitive book and other resources on the web.

My problem:
1.) Syslog entries contain a priority followed by a timestamp and 
hostname (and some more). The actual log entry follows these tokens.
2.) I want to distinguish the timestamp and ip/hostname in header from 
the ones that may be contained in the log message.
3.) The rules i have put together below result in a warning:
Decision can match input such as "TS" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Decision can match input such as "HOSTNAME" using multiple alternatives: 
2, 3
As a result, alternative(s) 3 were disabled for that input
Decision can match input such as "IPV4" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
Decision can match input such as "PRIO..WS" using multiple alternatives: 
1, 2
As a result, alternative(s) 2 were disabled for that input
The following alternatives are unreachable: 2

How can I set the precendence of the elements of the header rule to tell 
the parser that a TS can appear both in the header as well as the message?

4.) The impression I got after reading the documentation was that the 
(.*EOL) rule would consume all characters until EOL. I expected this 
rule to override the WS rule (of skipping whitespaces) and just give me 
a token of a big string (I am using this rule as the 'message' rule below).
Instead everything after dot is tokenized by the lexer before returning 
me 'all tokens' (since .) instead of all characters - did I misread 
something here?

Thanks in advance.
My rules file is attached here: -->

// Log messages are split into header and message part.
logMessage
    :    header message
    {
    };

//Header is PRIORITY + TIMESTAMP + (IP address or hostname). I have made 
TIMESTAMP optional as I am dealing with non-conformant syslog entries.
    header    :    p=PRIO t=TS? (ip=IPV4 | h=HOSTNAME)?
{
    System.out.println("p=" + $p.text + " t=" + $t.text + " ip=" + 
($ip!=null?$ip.text:"null") + " h=" + ($h!=null?$h.text:"null"));
};

//Message is all characters until EOL or EOF.
message    :    (m+=.*) EOL? EOF?
{
    for (int i=0; i < $m.size(); i++) {
        System.out.println("m[" + i + "]=" + 
((Token)$m.get(i)).getText() + " - " + ((Token)$m.get(i)).getType());
    }
}
;   

PRIO    :    '<' (i+=INT)+ {$i.size() <= 4}? '>';

DATE    :    MONTH TWOINTS TWOINTS ':' TWOINTS ':' TWOINTS;

TS    :    (BIGLTR SMALLLTR SMALLLTR) ' ' TWOINTS ' ' TIME;


//IMPORTANT: Make sure IPV4 is before HOSTNAME as it is a subset of 
HOSTNAME and we want it matched first.   
IPV4
    :    THREEINTS '.' THREEINTS '.' THREEINTS;

HOSTNAME:    DOMAINPART ('.' DOMAINPART)+;


fragment
DOMAINPART
    :    ALPHANUM ('-' ALPHANUM)*;
       
ALPHANUM:    (LETTER | INT )+;

fragment   
TIME    :    TWOINTS ':' TWOINTS ':' TWOINTS;

fragment
MONTH    :    LETTER LETTER LETTER;

fragment
TWOINTS:    INT INT;

fragment
THREEINTS
    :    INT INT INT;
   
fragment
LETTER    :    ('a'..'z'|'A'..'Z');

fragment
SMALLLTR:    ('a'..'z');

fragment
BIGLTR    :    ('A'..'Z');

fragment
INT    :    '0'..'9';

fragment
EOL    :    (('\r\n') | ('\n'));

WS     :     (' ' |'\t' |'\n' |'\r' )+ {skip();} ;


--------
Regards.
Inder Sabharwal


More information about the antlr-interest mailing list