[antlr-interest] Scanning Perl-style quoted strings q{foo{bar}quux}?!

Ralf S. Engelschall rse+antlr-interest at engelschall.com
Thu Jul 30 10:26:43 PDT 2009

On Thu, Jul 30, 2009, Ralf S. Engelschall wrote:

> [...]
> My current solution is now:
>                     /* Perl-style quoted string */
> fragment QS_ANGLE   : '<' (('\\' '<') => '\\' '<' | QS_ANGLE | ~('<' | '>'))* '>';
> fragment QS_BRACE   : '{' (('\\' '{') => '\\' '{' | QS_BRACE | ~('{' | '}'))* '}';
> fragment QS_BRACK   : '[' (('\\' '[') => '\\' '[' | QS_BRACK | ~('[' | '}'))* ']';
> fragment QS_PAREN   : '(' (('\\' '(') => '\\' '(' | QS_PAREN | ~('(' | ')'))* ')';
> fragment QS_OTHER_CH: ~('<'|'>'|'{'|'}'|'['|']'|'('|')'|'a'..'z'|'A'..'Z'|'0'..'9');
> fragment QS_OTHER   : delimiter=QS_OTHER_CH
>                       ( '\\' { input.LT(1) == $delimiter.text.charAt(0) }? => .
>                       |      { input.LT(1) != $delimiter.text.charAt(0) }? => .
>                       )*
>                       { input.LT(1) == $delimiter.text.charAt(0) }? => .;
> This already correctly recognizes all qX...X constructs. I now just have
> to filter out the escape sequences and remove the leading qX and the
> trailing X...

After many attempts to add the escape character filtering I now also
have a solution for this:

                  /* Perl-style quoted string */
QSTRING           @init { StringBuilder sb = new StringBuilder(); }
                  : 'q'
                    ( '<' QS_ANGLE[sb] '>'
                    | '{' QS_BRACE[sb] '}'
                    | '[' QS_BRACK[sb] ']'
                    | '(' QS_PAREN[sb] ')'
                    | QS_OTHER[sb]
                    )                                        { setText(sb.toString()); }
fragment QS_ANGLE [StringBuilder sb]
                  : ( ('\\' '<') => '\\' c='<'               { sb.appendCodePoint($c); }
                    | ('\\' '>') => '\\' c='>'               { sb.appendCodePoint($c); }
                    | c='<'                                  { sb.appendCodePoint($c); }
                      c='>'                                  { sb.appendCodePoint($c); }
                    | c=~('<' | '>')                         { sb.appendCodePoint($c); }
fragment QS_BRACE [StringBuilder sb]
                  : ( ('\\' '{') => '\\' c='{'               { sb.appendCodePoint($c); }
                    | ('\\' '}') => '\\' c='}'               { sb.appendCodePoint($c); }
                    | c='{'                                  { sb.appendCodePoint($c); }
                      c='}'                                  { sb.appendCodePoint($c); }
                    | c=~('{' | '}')                         { sb.appendCodePoint($c); }
fragment QS_BRACK [StringBuilder sb]
                  : ( ('\\' '[') => '\\' c='['               { sb.appendCodePoint($c); }
                    | ('\\' ']') => '\\' c=']'               { sb.appendCodePoint($c); }
                    | c='['                                  { sb.appendCodePoint($c); }
                      c=']'                                  { sb.appendCodePoint($c); }
                    | c=~('[' | ']')                         { sb.appendCodePoint($c); }
fragment QS_PAREN [StringBuilder sb]
                  : ( ('\\' '(') => '\\' c='('               { sb.appendCodePoint($c); }
                    | ('\\' ')') => '\\' c=')'               { sb.appendCodePoint($c); }
                    | c='('                                  { sb.appendCodePoint($c); }
                      c=')'                                  { sb.appendCodePoint($c); }
                    | c=~('(' | ')')                         { sb.appendCodePoint($c); }
fragment QS_OTHER [StringBuilder sb]
                  : d=('!'|'"'|'#'|'$'|'%'|'&'
                    ( { input.LT(1) == '\\' &&
                        input.LT(2) == $d     }? => '\\' c=. { sb.appendCodePoint($c); }
                    | { input.LT(1) != $d     }? =>      c=. { sb.appendCodePoint($c); }
                    { input.LT(1) == $d }? => .

This now is the most complete ANTLR lexical rule set for parsing
Perl-style q/.../ constructs. Unfortunately, it is no longer really
short and concise, but it now seems to finally work as expected.

Thanks for all the help.
                                       Ralf S. Engelschall
                                       rse at engelschall.com

