[antlr-interest] Scanning Perl-style quoted strings q{foo{bar}quux}?!

Ralf S. Engelschall rse+antlr-interest at engelschall.com
Thu Jul 30 01:13:18 PDT 2009


On Wed, Jul 29, 2009, David-Sarah Hopwood wrote:

> [...]
> (I'm assuming, without knowing Perl very well, that only the delimiters
> that appear on the "outside" have to nest, e.g. q{foo[{bar}quux} is valid.)

Yes, exactly. Only the "X" in qX...X have to nest correctly. Anything
between is talen as-is.

> > Remains the question: what is the best way to implement this in ANTLR 3?
>
> Remember that lexer rules can be recursive, so you don't have to explicitly
> keep track of nesting depth. The following approach (untested) is more
> declarative, and incidentally avoids the problem you encountered:
>
> QSTRING
>   : 'q' ( AngleQS | BraceQS | BrackQS | ParenQS | SlashQS | BangQS ) ;
>
> fragment AngleQS
>   : '<' ( AngleQS | ~('<' | '>') )* '>' ;
>
> fragment BraceQS
>   : '{' ( BraceQS | ~('{' | '}') )* '}' ;
>
> fragment BrackQS
>   : '[' ( BrackQS | ~('[' | ']') )* ']' ;
>
> fragment ParenQS
>   : '[' ( ParenQS | ~('[' | ']') )* ']' ;
>
> fragment SlashQS
>   : '/' ( SlashQS | ~'/' )* '/' ;
>
> fragment BangQS
>   : '!' ( BangQS | ~'!' )* '!' ;

Hmmmm.... interesting approach. Many thanks for the hint about the
recursion possibility in lexer rules.

Remains just the problem that although the opening/closing characters
are the fixed set of 4 pairs, the "/" and "!" were just examples.
Actually any other punctuation character can be used, also for instance
q%...%, q=....=, etc. But here the semantic predicates can help again, I
think. My current solution is now:

                    /* Perl-style quoted string */
QSTRING             : 'q' (QS_ANGLE | QS_BRACE | QS_BRACK | QS_PAREN | QS_OTHER);
fragment QS_ANGLE   : '<' (('\\' '<') => '\\' '<' | QS_ANGLE | ~('<' | '>'))* '>';
fragment QS_BRACE   : '{' (('\\' '{') => '\\' '{' | QS_BRACE | ~('{' | '}'))* '}';
fragment QS_BRACK   : '[' (('\\' '[') => '\\' '[' | QS_BRACK | ~('[' | '}'))* ']';
fragment QS_PAREN   : '(' (('\\' '(') => '\\' '(' | QS_PAREN | ~('(' | ')'))* ')';
fragment QS_OTHER_CH: ~('<'|'>'|'{'|'}'|'['|']'|'('|')'|'a'..'z'|'A'..'Z'|'0'..'9');
fragment QS_OTHER   : delimiter=QS_OTHER_CH
                      ( '\\' { input.LT(1) == $delimiter.text.charAt(0) }? => .
                      |      { input.LT(1) != $delimiter.text.charAt(0) }? => .
                      )*
                      { input.LT(1) == $delimiter.text.charAt(0) }? => .;

This already correctly recognizes all qX...X constructs. I now just have
to filter out the escape sequences and remove the leading qX and the
trailing X...
                                       Ralf S. Engelschall
                                       rse at engelschall.com
                                       www.engelschall.com



More information about the antlr-interest mailing list