[antlr-interest] Search free text form for special tags grammar

Ronald Haring ronald.haring at gmail.com
Mon Sep 10 00:59:56 PDT 2007


Hello all,

I've been trying to create a grammar to search in some random text for a
href html tags. I can't use existing parsers (like tagsoup) since the free
text can contain asp, or jsp, or velocity or freemarker code, but I still
have to search in the text for a href tags. My grammar so far is working in
that it finds one a href tag, but I cant tell it to search a complete
document and return all the a hrefs.

Here is my grammar so far

grammar AHrefFinder;
@header {
import java.util.HashMap;
import java.util.Map;
}
href  returns [Map map]
@init {
    map = new HashMap();
}
    :
     lt (WS)* 'a' (attr[map])* '>' (.)* '</a>';

attr[Map map]
@init {
    String key = null;
    String value = null;
}
@after {
    map.put(key, value);
}
    : attrKey {
        key = $attrKey.attributeKey;
    }
    ('=' attrValue {
        value = $attrValue.attributeValue;
    })?;

attrKey returns [String attributeKey]
        : WORD {
        attributeKey = $WORD.text;
};

attrValue returns [String attributeValue]
        : (
            WORD {
                attributeValue = $WORD.text;
            }
            |
            STRING {
                attributeValue = $STRING.text;
                attributeValue = attributeValue.substring(1,
attributeValue.length() -1);
            }
          );

allwords    : (anyword)*;
anyword        : (.)+;
lt        : '<';
/** Match until next whitespace; can be file, int, etc... */
WORD:    (
        'a'..'z' | '0'..'9' | '/' | '.' | '#' | '_'
        )+
    ;

protected
STRING
    :    '"' (~'"')* '"'
    |    '\'' (~'\'')* '\''
    ;


protected
WS    :    (    ' '
        |    '\t'
        |    '\f'
        |    (    '\r\n'  // DOS
            |    '\r'    // Macintosh
            |    '\n'    // Unix (the right way)
            )
        )
        {skip(); }
    ;


Java file for testing

public class TestHRefFinder {


    /**
     * User: RHaring
     * Date: 4-sep-2007
     * Time: 17:07:05
     */
        public static void main(String[] args) throws Exception {
            String definition = "some free form text <a href=\"
http://www.test.nl\">test</a>And even more text <a href=\"
http://www.test2.nl\">even more text</a>";
            ANTLRInputStream input = new ANTLRInputStream(new
ByteArrayInputStream(definition.getBytes()));
            AHrefFinderLexer lexer = new AHrefFinderLexer(input);
            CommonTokenStream tokens = new CommonTokenStream(lexer);
            AHrefFinderParser parser = new AHrefFinderParser(tokens);
            Map map = parser.href();
            // yet this is correct, I have found the first href, but how do
I find all the others???

        }
}





Any tips, hints or remarks are greatly appreciated.

Regards
Ronald
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20070910/8ab7b013/attachment.html 


More information about the antlr-interest mailing list