[antlr-interest] Scanning Perl-style quoted strings q{foo{bar}quux}?!

Ralf S. Engelschall rse+antlr-interest at engelschall.com
Thu Jul 30 10:26:43 PDT 2009


On Thu, Jul 30, 2009, Ralf S. Engelschall wrote:

> [...]
> My current solution is now:
>
>                     /* Perl-style quoted string */
> QSTRING             : 'q' (QS_ANGLE | QS_BRACE | QS_BRACK | QS_PAREN | QS_OTHER);
> fragment QS_ANGLE   : '<' (('\\' '<') => '\\' '<' | QS_ANGLE | ~('<' | '>'))* '>';
> fragment QS_BRACE   : '{' (('\\' '{') => '\\' '{' | QS_BRACE | ~('{' | '}'))* '}';
> fragment QS_BRACK   : '[' (('\\' '[') => '\\' '[' | QS_BRACK | ~('[' | '}'))* ']';
> fragment QS_PAREN   : '(' (('\\' '(') => '\\' '(' | QS_PAREN | ~('(' | ')'))* ')';
> fragment QS_OTHER_CH: ~('<'|'>'|'{'|'}'|'['|']'|'('|')'|'a'..'z'|'A'..'Z'|'0'..'9');
> fragment QS_OTHER   : delimiter=QS_OTHER_CH
>                       ( '\\' { input.LT(1) == $delimiter.text.charAt(0) }? => .
>                       |      { input.LT(1) != $delimiter.text.charAt(0) }? => .
>                       )*
>                       { input.LT(1) == $delimiter.text.charAt(0) }? => .;
>
> This already correctly recognizes all qX...X constructs. I now just have
> to filter out the escape sequences and remove the leading qX and the
> trailing X...

After many attempts to add the escape character filtering I now also
have a solution for this:

                  /* Perl-style quoted string */
QSTRING           @init { StringBuilder sb = new StringBuilder(); }
                  : 'q'
                    ( '<' QS_ANGLE[sb] '>'
                    | '{' QS_BRACE[sb] '}'
                    | '[' QS_BRACK[sb] ']'
                    | '(' QS_PAREN[sb] ')'
                    | QS_OTHER[sb]
                    )                                        { setText(sb.toString()); }
                  ;
fragment QS_ANGLE [StringBuilder sb]
                  : ( ('\\' '<') => '\\' c='<'               { sb.appendCodePoint($c); }
                    | ('\\' '>') => '\\' c='>'               { sb.appendCodePoint($c); }
                    | c='<'                                  { sb.appendCodePoint($c); }
                      QS_ANGLE[sb]
                      c='>'                                  { sb.appendCodePoint($c); }
                    | c=~('<' | '>')                         { sb.appendCodePoint($c); }
                    )*
                  ;
fragment QS_BRACE [StringBuilder sb]
                  : ( ('\\' '{') => '\\' c='{'               { sb.appendCodePoint($c); }
                    | ('\\' '}') => '\\' c='}'               { sb.appendCodePoint($c); }
                    | c='{'                                  { sb.appendCodePoint($c); }
                      QS_BRACE[sb]
                      c='}'                                  { sb.appendCodePoint($c); }
                    | c=~('{' | '}')                         { sb.appendCodePoint($c); }
                    )*
                  ;
fragment QS_BRACK [StringBuilder sb]
                  : ( ('\\' '[') => '\\' c='['               { sb.appendCodePoint($c); }
                    | ('\\' ']') => '\\' c=']'               { sb.appendCodePoint($c); }
                    | c='['                                  { sb.appendCodePoint($c); }
                      QS_BRACE[sb]
                      c=']'                                  { sb.appendCodePoint($c); }
                    | c=~('[' | ']')                         { sb.appendCodePoint($c); }
                    )*
                  ;
fragment QS_PAREN [StringBuilder sb]
                  : ( ('\\' '(') => '\\' c='('               { sb.appendCodePoint($c); }
                    | ('\\' ')') => '\\' c=')'               { sb.appendCodePoint($c); }
                    | c='('                                  { sb.appendCodePoint($c); }
                      QS_BRACE[sb]
                      c=')'                                  { sb.appendCodePoint($c); }
                    | c=~('(' | ')')                         { sb.appendCodePoint($c); }
                    )*
                  ;
fragment QS_OTHER [StringBuilder sb]
                  : d=('!'|'"'|'#'|'$'|'%'|'&'
                      |'\''|'*'|'+'|','|'-'|'.'|'/'
                      |':'|';'|'='|'?'|'@'|'\\'|'^'
                      |'_'|'`'|'|'|'~'
                      )
                    ( { input.LT(1) == '\\' &&
                        input.LT(2) == $d     }? => '\\' c=. { sb.appendCodePoint($c); }
                    | { input.LT(1) != $d     }? =>      c=. { sb.appendCodePoint($c); }
                    )*
                    { input.LT(1) == $d }? => .
                  ;

This now is the most complete ANTLR lexical rule set for parsing
Perl-style q/.../ constructs. Unfortunately, it is no longer really
short and concise, but it now seems to finally work as expected.

Thanks for all the help.
                                       Ralf S. Engelschall
                                       rse at engelschall.com
                                       www.engelschall.com



More information about the antlr-interest mailing list