[antlr-interest] How to match a phrase i.e. multiple words?

Gavin Lambert antlr at mirality.co.nz
Sat Feb 14 23:11:37 PST 2009


At 04:18 15/02/2009, Swaroop C H wrote:
 >    PHRASE
 >        : '"' WORD '"' { $text = $WORD.text }
 >        ;
 >
 >    WORD
 >        : ( 'a'..'z' | 'A'..'Z' | '.' )+
 >        ;
 >
 >    WHITESPACE
 >        :   (' '|'\t'|'\n'|'\r')+ { self.skip() }
 >        ;
 >
 >The problem is that I'm unable to proceed from here. If I put
 >
 >    PHRASE
 >        : '"' w=WORD+ '"' { $text = $w.text }
 >        ;
 >
 >Then I get the following error:
 >
 >    ANTLR Parser Generator  Version 3.1.1
 >    line 1:22 mismatched character u' ' expecting u'"'
 >    line 1:28 required (...)+ loop did not match anything at
 >character '<EOF>'
 >    line 1:23 missing PHRASE at u'todo'
 >    description = <missing PHRASE>

Since PHRASE is a lexer rule, it is at the same "level" as the 
WHITESPACE rule, and thus whitespace isn't magically removed from 
the character stream (like it would be if it were a parser rule 
instead).

If you want to retain the discrete WORD tokens, then you could 
change PHRASE into a parser rule.  I suspect that's not really 
what you want, though.

The simplest thing to do is simply to define a PHRASE as anything 
at all within quotes:

PHRASE
   : '"' .* '"'
   ;

If you want to restrict the accepted characters, though, then you 
could use something like this:

PHRASE
   : '"' (~('\r' | '\n' | '"'))* '"'
   ;

or this:

PHRASE
   : '"' (WORD | ' ' | '\t')* '"'
   ;

It's usually best though to let your lexer be fairly tolerant, and 
raise errors about invalid content at parse or tree-walk time 
instead.



More information about the antlr-interest mailing list