[antlr-interest] Config file parsing grammar
James Cook
bonkabonka at gmail.com
Sat Nov 25 02:21:11 PST 2006
Howdy
I've been banging on a grammar to parse Unix-style config files
(notably /etc/hosts, /etc/ethers and dhcpd's leases file) but haven't
had much luck. I'm sure it's a simple fix but I've been at it for
almost three days now and have just about reached the throwing-stuff
stage. =P Anyway, here's the lexer bits:
lexer grammar CommonUnixConfig;
// ---------------------------------------------------------------------
// Base
// ---------------------------------------------------------------------
WHITESPACE
: (' ' | '\t')+
;
NEWLINE
: ('\r\n' | '\n' | '\r')
;
COLON
: ':'
;
DOT
: '.'
;
STAR
: '*'
;
DASH
: '-'
;
HASH
: '#'
;
SLASH
: '/'
;
DIGIT
: '0'..'9'
;
HEXDIGIT
: DIGIT | 'a'..'f' | 'A'..'F'
;
LETTER
: 'a'..'z' | 'A'..'Z'
;
// ----------------------------------------------------------------------------
// Configuration Cruft
// ----------------------------------------------------------------------------
COMMENT
: HASH ~NEWLINE*
// { $channel=HIDDEN; System.out.println("comment"); }
{ System.out.println("comment"); skip(); }
;
BLANKLINE
: WHITESPACE? NEWLINE
{ System.out.println("blankline"); skip(); }
;
// ----------------------------------------------------------------------------
// Ethernet
// ----------------------------------------------------------------------------
CLIENTID
: HEXPAIR COLON MACADDRESS
;
MACADDRESS
: HEXPAIR COLON HEXPAIR COLON HEXPAIR COLON HEXPAIR
COLON HEXPAIR COLON HEXPAIR
;
fragment
HEXPAIR
: HEXDIGIT HEXDIGIT
;
// ----------------------------------------------------------------------------
// Internet Address (DNS and Bare IP)
// ----------------------------------------------------------------------------
IPADDRESS
: IPV4ADDRESS | IPV6ADDRESS
;
IPV4ADDRESS
: BYTE DOT BYTE DOT BYTE DOT BYTE
;
// RFC 2373 Appendix B is evil
IPV6ADDRESS
: HEXPART (COLON IPV4ADDRESS)?
;
HOSTNAME
: DNSCHAR+ (DOT DNSCHAR+)* DOT?
;
// RFC 2373 Appendix B says the four parts of an IPv4address can have only one
// to three digits
fragment
BYTE
: DIGIT (DIGIT DIGIT?)?
;
fragment
HEXPART
: HEXSEQ | HEXSEQ COLON COLON HEXSEQ? | COLON COLON HEXSEQ?
;
fragment
HEXSEQ
: HEX4 (COLON HEX4)*
;
fragment
HEX4
: HEXDIGIT (HEXDIGIT (HEXDIGIT HEXDIGIT?)?)?
;
// As defined in RFC 1034
fragment
DNSCHAR
: LETTER | DIGIT | DASH
;
======
Next up is the particular parser I've been focusing on:
parser grammar Hosts;
options {
tokenVocab = CommonUnixConfig;
}
go
: hostline*
;
hostline
: ip=IPADDRESS WHITESPACE hostname=HOSTNAME (WHITESPACE
alias=HOSTNAME{System.out.println("alias: " + $alias);})* NEWLINE
{
System.out.println("ip addr : " + $ip);
System.out.println("hostname : " + $hostname);
}
;
======
And then, finally, the test harness:
import org.antlr.runtime.*;
public class hosts
{
public static void main(String args[])
throws Throwable
{
ANTLRFileStream in = new ANTLRFileStream(args[0]);
CommonUnixConfigLexer lexer = new CommonUnixConfigLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
HostsParser parser = new HostsParser(tokens);
parser.go();
}
}
======
In the event that there aren't any blank lines or comments, the file
parses properly. However, add in a blank line or a comment and
parsing seems to abort without throwing an exception. =( Also, the
print statements never execute but I suspect I'm using them wrong.
I didn't have any luck finding examples to pattern my efforts after -
most often newlines and whitespace are ignorable whereas they're
delimiters here. Any help would be appreciated. Thanks!
--
James
More information about the antlr-interest
mailing list