[antlr-interest] Search free text form for special tags grammar
Ronald Haring
ronald.haring at gmail.com
Mon Sep 10 00:59:56 PDT 2007
Hello all,
I've been trying to create a grammar to search in some random text for a
href html tags. I can't use existing parsers (like tagsoup) since the free
text can contain asp, or jsp, or velocity or freemarker code, but I still
have to search in the text for a href tags. My grammar so far is working in
that it finds one a href tag, but I cant tell it to search a complete
document and return all the a hrefs.
Here is my grammar so far
grammar AHrefFinder;
@header {
import java.util.HashMap;
import java.util.Map;
}
href returns [Map map]
@init {
map = new HashMap();
}
:
lt (WS)* 'a' (attr[map])* '>' (.)* '</a>';
attr[Map map]
@init {
String key = null;
String value = null;
}
@after {
map.put(key, value);
}
: attrKey {
key = $attrKey.attributeKey;
}
('=' attrValue {
value = $attrValue.attributeValue;
})?;
attrKey returns [String attributeKey]
: WORD {
attributeKey = $WORD.text;
};
attrValue returns [String attributeValue]
: (
WORD {
attributeValue = $WORD.text;
}
|
STRING {
attributeValue = $STRING.text;
attributeValue = attributeValue.substring(1,
attributeValue.length() -1);
}
);
allwords : (anyword)*;
anyword : (.)+;
lt : '<';
/** Match until next whitespace; can be file, int, etc... */
WORD: (
'a'..'z' | '0'..'9' | '/' | '.' | '#' | '_'
)+
;
protected
STRING
: '"' (~'"')* '"'
| '\'' (~'\'')* '\''
;
protected
WS : ( ' '
| '\t'
| '\f'
| ( '\r\n' // DOS
| '\r' // Macintosh
| '\n' // Unix (the right way)
)
)
{skip(); }
;
Java file for testing
public class TestHRefFinder {
/**
* User: RHaring
* Date: 4-sep-2007
* Time: 17:07:05
*/
public static void main(String[] args) throws Exception {
String definition = "some free form text <a href=\"
http://www.test.nl\">test</a>And even more text <a href=\"
http://www.test2.nl\">even more text</a>";
ANTLRInputStream input = new ANTLRInputStream(new
ByteArrayInputStream(definition.getBytes()));
AHrefFinderLexer lexer = new AHrefFinderLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
AHrefFinderParser parser = new AHrefFinderParser(tokens);
Map map = parser.href();
// yet this is correct, I have found the first href, but how do
I find all the others???
}
}
Any tips, hints or remarks are greatly appreciated.
Regards
Ronald
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20070910/8ab7b013/attachment.html
More information about the antlr-interest
mailing list