[antlr-interest] simple URL extractor

bace.spam at gmx.net bace.spam at gmx.net
Tue Apr 24 09:53:52 PDT 2007


> > Hi all,
> > 
> > I want to extract an URL from a text with antlr v3. To separate the URL
> from the remaining, I want to search for each occurrence of 'http://'.
> > 
> > So I defined the lexer rule:
> > HTTP_INDICATOR : 'http://';
> > 
> > and parser rule:
> > url : HTTP_INDICATOR host (port)? (SLASH path)*;
> > 
> > 
> > If I uses this definition, and input something like
> > 'text http://www.goolge.com/index.html further text'
> > then the parser doesn't work as I imagined. The error message is that
> 't' was expected instead of 'm'. (The parser wants to match the 'html' with
> 'http://') But why?
> > 
> > Has anyone an idea how I can tell the lexer to search for the 'http://'?
> 
> I suppose, that you need to set the option filter=true; to implicitly
> discard all text not of your interest. Otherwise your first grammar
> looks fine.


No, unfortunately I cannot discard the remaining text. I know the grammar looks fine - but does not work. I get the same inappropriate behavior if I set backtrack=true; at the beginning. Do you know how I can set the backtrack-option for only one rule in antlr v3:

rule
(options {backtrack=true;})
    : alter1 | ... | alterN
    ;


> 
> > And as I tried to put the 'http://' in this parser rule (instead of
> > the HTTP_INDICATOR) I get an exception. Is it true that I cannot use
> > literals in parser rules (I got every time an exception)? But in the
> > examples for antlr v3 are literals in parser rules used?!
> 
> Parser rules may contain literals BUT not exclusively! You have to call
> another rule in a parser rule, otherwise it is a lexer rule.
> 
> Best regards,
> Johannes Luber


Best and thanks,
Markus
-- 
"Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ...
Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail


More information about the antlr-interest mailing list