[antlr-interest] Appending lexer rule disables rule further above. Bug?

Thu May 7 04:21:28 PDT 2009

I'd like to write a parser for some legacy data set definition files.
Here is a contrived example what the definitions look like:

        // A sample data type definition
        Type foo.bar.Strange name with 1000 weird chars: $\\d[] 
        {
            required searchable String Simple Title String field
            required searchable String Configured String field {
                maxValues   : 10
                validValues : The 1st required searchable value,
                              The 2nd value with no optional \, at end
                              The 3rd weird valid value ^%#${
                             "The 4th value, with optional quotes"
            }

            required searchable String 2nd Configured String field
            {
                maxValues     : 0   // i.e. infinite number of values
                validation {
                    maxLength : 50
                    pattern   : \d\d-[A-Z]+-.*
                }
                defaultValues : 01-ABC-An arbitrary char sequence,
                                02-DEF-Even \, is allowed,
                                03-GHI-And 1000 more variations
                displayWidth  : 30
            }
        }

I wanted to start with the lexer and as NEWLINE is significant at some parts I started
withChris Lambrou's JavaScript grammar http://antlr.org/grammar/1206736738015/JavaScript.g
which uses explicit line end tokens. I removed all the parser rules and looked what the
lexer would produce for my data type definition. Of course it produced at lot of
"no viable alternative at character ..." output. So I thought by adding the "catch all"
rule

Word
    : ((~NonWordChar)|'\\' .)+
    ;

fragment NonWordChar
    : LineTermChar
    | WhiteSpaceChar
    | '\\'
    | ','
    | '{'
    | '}'
    ;

at the end of the grammar file would fix some of these errors. But to my surprise it
didn't change very much. This rule for some reason causes an ambibuity between these
rules

Comment
    : '/*' (options {greedy=false;} : .)* '*/' //{ $channel=HIDDEN; }
    ;

LineComment
    : '//' (~LineTermChar)* //{ $channel=HIDDEN; }
    ;

so that line comments are not recognized any more. ANTLR writes the following warning

     warning(209): JavaScript.g:609:1: Multiple token rules can match input such as "'/'": Comment, LineComment, Word
     As a result, token(s) LineComment,Word were disabled for that input

Also characters escaped by a \ do not work as intended. It seems that the lexer insists on
recognizing a UNICODE escape \uXXXX when it encounters a \ and complains if something else
than an u follows the \.

Is this an intended behavior? My understandig was that LL(*) would mean that these kind
of ambuigities can not arise. Am I wrong? Does somebody understand this behavior and
can explain this to me and maybe even give some guidance how to overcome this problem.
Is ANTLR the right tool to create a parser for these weird data type definitions?

Thanks,
Nukiti

%%%%%%%%%%%%%%%%%%%%%%%%%%%% THE GRAMMAR %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

grammar Type;

options {
    language = Java;
}

@header {
package test.antlr.test;
}

@members {
    private static char nl = '\n';

    private void prtok(int type, String value) {
        System.out.println("Parsing: " + tokenNames[type] + ": " + value);
    }

    public static void main(String[] args) throws Exception {
        String inputStr =
        "// A sample data type definition"                                + nl +
        "Type foo.bar.Strange name with 1000 weird chars: $\\d[]"         + nl +
        "{"                                                               + nl +
        "    required searchable String Simple Title String field"        + nl +
        "    required searchable String Configured String field {"        + nl +
        "        maxValues   : 10"                                        + nl +
        "        validValues : The 1st required searchable value,"        + nl +
        "                      The 2nd value with no optional \\, at end" + nl +
        "                      The 3rd weird valid value ^%#${"           + nl +
        "                    \"The 4th value, with optional quotes\""     + nl +
        "    }"                                                           + nl +
        ""                                                                + nl +
        "    required searchable String 2nd Configured String field"      + nl +
        "    {"                                                           + nl +
        "        maxValues     : 0   // i.e. infinite number of values"   + nl +
        "        validation {"                                            + nl +
        "            maxLength : 50"                                      + nl +
        "            pattern   : \\d\\d-[A-Z]+-.*"                        + nl +
        "        }\n"                                                     + nl +
        "        defaultValues : 01-ABC-An arbitrary char sequence,"      + nl +
        "                        02-DEF-Even \\, is allowed,"             + nl +
        "                        03-GHI-And 1000 more variations"         + nl +
        "        displayWidth  : 30"                                      + nl +
        "    }\n"                                                         + nl +
        "}"                                                               + nl ;

        ANTLRStringStream input  = new ANTLRStringStream(inputStr); 
        TypeLexer         lexer  = new TypeLexer(input); 
        CommonTokenStream tokens = new CommonTokenStream(lexer); 
        TypeParser        parser = new TypeParser(tokens); 
        parser.stuff();
    } 
}

@lexer::header {
package test.antlr.test;
}

stuff
    : (c=. {prtok($c.type,$c.text);})+ EOF
    ;

// lexer rules.
StringLiteral
    : '"' DoubleStringCharacter* '"'
    | '\'' SingleStringCharacter* '\''
    ;

fragment DoubleStringCharacter
    : ~('"' | '\\' | LT)
    | '\\' EscapeSequence
    ;

fragment SingleStringCharacter
    : ~('\'' | '\\' | LT)   
    | '\\' EscapeSequence
    ;

fragment EscapeSequence
    : CharacterEscapeSequence
    | '0'
    | HexEscapeSequence
    | UnicodeEscapeSequence
    ;

fragment CharacterEscapeSequence
    : SingleEscapeCharacter
    | NonEscapeCharacter
    ;

fragment NonEscapeCharacter
    : ~(EscapeCharacter | LT)
    ;

fragment SingleEscapeCharacter
    : '\'' | '"' | '\\' | 'b' | 'f' | 'n' | 'r' | 't' | 'v'
    ;

fragment EscapeCharacter
    : SingleEscapeCharacter
    | DecimalDigit
    | 'x'
    | 'u'
    ;

fragment HexEscapeSequence
    : 'x' HexDigit HexDigit
    ;

fragment UnicodeEscapeSequence
    : 'u' HexDigit HexDigit HexDigit HexDigit
    ;

NumericLiteral
    : DecimalLiteral
    | HexIntegerLiteral
    ;

fragment HexIntegerLiteral
    : '0' ('x' | 'X') HexDigit+
    ;

fragment HexDigit
    : DecimalDigit | ('a'..'f') | ('A'..'F')
    ;

fragment DecimalLiteral
    : DecimalDigit+ '.' DecimalDigit* ExponentPart?
    | '.'? DecimalDigit+ ExponentPart?
    ;

fragment DecimalDigit
    : ('0'..'9')
    ;

fragment ExponentPart
    : ('e' | 'E') ('+' | '-') ? DecimalDigit+
    ;

Identifier
    : IdentifierStart IdentifierPart*
    ;

fragment IdentifierStart
    : UnicodeLetter
    | '$'
    | '_'
    | '\\' UnicodeEscapeSequence
    ;

fragment IdentifierPart
    : (IdentifierStart) => IdentifierStart // Avoids ambiguity, as some IdentifierStart chars also match following alternatives.
    | UnicodeDigit
    | UnicodeConnectorPunctuation
    ;

fragment UnicodeLetter      // Any character in the Unicode categories "Uppercase letter (Lu)", 
    : '\u0041'..'\u005A'    // "Lowercase letter (Ll)", "Titlecase letter (Lt)",
    | '\u0061'..'\u007A'    // "Modifier letter (Lm)", "Other letter (Lo)", or "Letter number (Nl)".
    | '\u00AA'
    | '\u00B5'
    | '\u00BA'
    | '\u00C0'..'\u00D6'
    | '\u00D8'..'\u00F6'
    | '\u00F8'..'\u021F'
    | '\u0222'..'\u0233'
    | '\u0250'..'\u02AD'
    | '\u02B0'..'\u02B8'
    | '\u02BB'..'\u02C1'
    | '\u02D0'..'\u02D1'
    | '\u02E0'..'\u02E4'
    | '\u02EE'
    | '\u037A'
    | '\u0386'
    | '\u0388'..'\u038A'
    | '\u038C'
    | '\u038E'..'\u03A1'
    | '\u03A3'..'\u03CE'
    | '\u03D0'..'\u03D7'
    | '\u03DA'..'\u03F3'
    | '\u0400'..'\u0481'
    | '\u048C'..'\u04C4'
    | '\u04C7'..'\u04C8'
    | '\u04CB'..'\u04CC'
    | '\u04D0'..'\u04F5'
    | '\u04F8'..'\u04F9'
    | '\u0531'..'\u0556'
    | '\u0559'
    | '\u0561'..'\u0587'
    | '\u05D0'..'\u05EA'
    | '\u05F0'..'\u05F2'
    | '\u0621'..'\u063A'
    | '\u0640'..'\u064A'
    | '\u0671'..'\u06D3'
    | '\u06D5'
    | '\u06E5'..'\u06E6'
    | '\u06FA'..'\u06FC'
    | '\u0710'
    | '\u0712'..'\u072C'
    | '\u0780'..'\u07A5'
    | '\u0905'..'\u0939'
    | '\u093D'
    | '\u0950'
    | '\u0958'..'\u0961'
    | '\u0985'..'\u098C'
    | '\u098F'..'\u0990'
    | '\u0993'..'\u09A8'
    | '\u09AA'..'\u09B0'
    | '\u09B2'
    | '\u09B6'..'\u09B9'
    | '\u09DC'..'\u09DD'
    | '\u09DF'..'\u09E1'
    | '\u09F0'..'\u09F1'
    | '\u0A05'..'\u0A0A'
    | '\u0A0F'..'\u0A10'
    | '\u0A13'..'\u0A28'
    | '\u0A2A'..'\u0A30'
    | '\u0A32'..'\u0A33'
    | '\u0A35'..'\u0A36'
    | '\u0A38'..'\u0A39'
    | '\u0A59'..'\u0A5C'
    | '\u0A5E'
    | '\u0A72'..'\u0A74'
    | '\u0A85'..'\u0A8B'
    | '\u0A8D'
    | '\u0A8F'..'\u0A91'
    | '\u0A93'..'\u0AA8'
    | '\u0AAA'..'\u0AB0'
    | '\u0AB2'..'\u0AB3'
    | '\u0AB5'..'\u0AB9'
    | '\u0ABD'
    | '\u0AD0'
    | '\u0AE0'
    | '\u0B05'..'\u0B0C'
    | '\u0B0F'..'\u0B10'
    | '\u0B13'..'\u0B28'
    | '\u0B2A'..'\u0B30'
    | '\u0B32'..'\u0B33'
    | '\u0B36'..'\u0B39'
    | '\u0B3D'
    | '\u0B5C'..'\u0B5D'
    | '\u0B5F'..'\u0B61'
    | '\u0B85'..'\u0B8A'
    | '\u0B8E'..'\u0B90'
    | '\u0B92'..'\u0B95'
    | '\u0B99'..'\u0B9A'
    | '\u0B9C'
    | '\u0B9E'..'\u0B9F'
    | '\u0BA3'..'\u0BA4'
    | '\u0BA8'..'\u0BAA'
    | '\u0BAE'..'\u0BB5'
    | '\u0BB7'..'\u0BB9'
    | '\u0C05'..'\u0C0C'
    | '\u0C0E'..'\u0C10'
    | '\u0C12'..'\u0C28'
    | '\u0C2A'..'\u0C33'
    | '\u0C35'..'\u0C39'
    | '\u0C60'..'\u0C61'
    | '\u0C85'..'\u0C8C'
    | '\u0C8E'..'\u0C90'
    | '\u0C92'..'\u0CA8'
    | '\u0CAA'..'\u0CB3'
    | '\u0CB5'..'\u0CB9'
    | '\u0CDE'
    | '\u0CE0'..'\u0CE1'
    | '\u0D05'..'\u0D0C'
    | '\u0D0E'..'\u0D10'
    | '\u0D12'..'\u0D28'
    | '\u0D2A'..'\u0D39'
    | '\u0D60'..'\u0D61'
    | '\u0D85'..'\u0D96'
    | '\u0D9A'..'\u0DB1'
    | '\u0DB3'..'\u0DBB'
    | '\u0DBD'
    | '\u0DC0'..'\u0DC6'
    | '\u0E01'..'\u0E30'
    | '\u0E32'..'\u0E33'
    | '\u0E40'..'\u0E46'
    | '\u0E81'..'\u0E82'
    | '\u0E84'
    | '\u0E87'..'\u0E88'
    | '\u0E8A'
    | '\u0E8D'
    | '\u0E94'..'\u0E97'
    | '\u0E99'..'\u0E9F'
    | '\u0EA1'..'\u0EA3'
    | '\u0EA5'
    | '\u0EA7'
    | '\u0EAA'..'\u0EAB'
    | '\u0EAD'..'\u0EB0'
    | '\u0EB2'..'\u0EB3'
    | '\u0EBD'..'\u0EC4'
    | '\u0EC6'
    | '\u0EDC'..'\u0EDD'
    | '\u0F00'
    | '\u0F40'..'\u0F6A'
    | '\u0F88'..'\u0F8B'
    | '\u1000'..'\u1021'
    | '\u1023'..'\u1027'
    | '\u1029'..'\u102A'
    | '\u1050'..'\u1055'
    | '\u10A0'..'\u10C5'
    | '\u10D0'..'\u10F6'
    | '\u1100'..'\u1159'
    | '\u115F'..'\u11A2'
    | '\u11A8'..'\u11F9'
    | '\u1200'..'\u1206'
    | '\u1208'..'\u1246'
    | '\u1248'
    | '\u124A'..'\u124D'
    | '\u1250'..'\u1256'
    | '\u1258'
    | '\u125A'..'\u125D'
    | '\u1260'..'\u1286'
    | '\u1288'
    | '\u128A'..'\u128D'
    | '\u1290'..'\u12AE'
    | '\u12B0'
    | '\u12B2'..'\u12B5'
    | '\u12B8'..'\u12BE'
    | '\u12C0'
    | '\u12C2'..'\u12C5'
    | '\u12C8'..'\u12CE'
    | '\u12D0'..'\u12D6'
    | '\u12D8'..'\u12EE'
    | '\u12F0'..'\u130E'
    | '\u1310'
    | '\u1312'..'\u1315'
    | '\u1318'..'\u131E'
    | '\u1320'..'\u1346'
    | '\u1348'..'\u135A'
    | '\u13A0'..'\u13B0'
    | '\u13B1'..'\u13F4'
    | '\u1401'..'\u1676'
    | '\u1681'..'\u169A'
    | '\u16A0'..'\u16EA'
    | '\u1780'..'\u17B3'
    | '\u1820'..'\u1877'
    | '\u1880'..'\u18A8'
    | '\u1E00'..'\u1E9B'
    | '\u1EA0'..'\u1EE0'
    | '\u1EE1'..'\u1EF9'
    | '\u1F00'..'\u1F15'
    | '\u1F18'..'\u1F1D'
    | '\u1F20'..'\u1F39'
    | '\u1F3A'..'\u1F45'
    | '\u1F48'..'\u1F4D'
    | '\u1F50'..'\u1F57'
    | '\u1F59'
    | '\u1F5B'
    | '\u1F5D'
    | '\u1F5F'..'\u1F7D'
    | '\u1F80'..'\u1FB4'
    | '\u1FB6'..'\u1FBC'
    | '\u1FBE'
    | '\u1FC2'..'\u1FC4'
    | '\u1FC6'..'\u1FCC'
    | '\u1FD0'..'\u1FD3'
    | '\u1FD6'..'\u1FDB'
    | '\u1FE0'..'\u1FEC'
    | '\u1FF2'..'\u1FF4'
    | '\u1FF6'..'\u1FFC'
    | '\u207F'
    | '\u2102'
    | '\u2107'
    | '\u210A'..'\u2113'
    | '\u2115'
    | '\u2119'..'\u211D'
    | '\u2124'
    | '\u2126'
    | '\u2128'
    | '\u212A'..'\u212D'
    | '\u212F'..'\u2131'
    | '\u2133'..'\u2139'
    | '\u2160'..'\u2183'
    | '\u3005'..'\u3007'
    | '\u3021'..'\u3029'
    | '\u3031'..'\u3035'
    | '\u3038'..'\u303A'
    | '\u3041'..'\u3094'
    | '\u309D'..'\u309E'
    | '\u30A1'..'\u30FA'
    | '\u30FC'..'\u30FE'
    | '\u3105'..'\u312C'
    | '\u3131'..'\u318E'
    | '\u31A0'..'\u31B7'
    | '\u3400'
    | '\u4DB5'
    | '\u4E00'
    | '\u9FA5'
    | '\uA000'..'\uA48C'
    | '\uAC00'
    | '\uD7A3'
    | '\uF900'..'\uFA2D'
    | '\uFB00'..'\uFB06'
    | '\uFB13'..'\uFB17'
    | '\uFB1D'
    | '\uFB1F'..'\uFB28'
    | '\uFB2A'..'\uFB36'
    | '\uFB38'..'\uFB3C'
    | '\uFB3E'
    | '\uFB40'..'\uFB41'
    | '\uFB43'..'\uFB44'
    | '\uFB46'..'\uFBB1'
    | '\uFBD3'..'\uFD3D'
    | '\uFD50'..'\uFD8F'
    | '\uFD92'..'\uFDC7'
    | '\uFDF0'..'\uFDFB'
    | '\uFE70'..'\uFE72'
    | '\uFE74'
    | '\uFE76'..'\uFEFC'
    | '\uFF21'..'\uFF3A'
    | '\uFF41'..'\uFF5A'
    | '\uFF66'..'\uFFBE'
    | '\uFFC2'..'\uFFC7'
    | '\uFFCA'..'\uFFCF'
    | '\uFFD2'..'\uFFD7'
    | '\uFFDA'..'\uFFDC'
    ;

fragment UnicodeCombiningMark   // Any character in the Unicode categories "Non-spacing mark (Mn)"
    : '\u0300'..'\u034E'    // or "Combining spacing mark (Mc)".
    | '\u0360'..'\u0362'
    | '\u0483'..'\u0486'
    | '\u0591'..'\u05A1'
    | '\u05A3'..'\u05B9'
    | '\u05BB'..'\u05BD'
    | '\u05BF' 
    | '\u05C1'..'\u05C2'
    | '\u05C4'
    | '\u064B'..'\u0655'
    | '\u0670'
    | '\u06D6'..'\u06DC'
    | '\u06DF'..'\u06E4'
    | '\u06E7'..'\u06E8'
    | '\u06EA'..'\u06ED'
    | '\u0711'
    | '\u0730'..'\u074A'
    | '\u07A6'..'\u07B0'
    | '\u0901'..'\u0903'
    | '\u093C'
    | '\u093E'..'\u094D'
    | '\u0951'..'\u0954'
    | '\u0962'..'\u0963'
    | '\u0981'..'\u0983'
    | '\u09BC'..'\u09C4'
    | '\u09C7'..'\u09C8'
    | '\u09CB'..'\u09CD'
    | '\u09D7'
    | '\u09E2'..'\u09E3'
    | '\u0A02'
    | '\u0A3C'
    | '\u0A3E'..'\u0A42'
    | '\u0A47'..'\u0A48'
    | '\u0A4B'..'\u0A4D'
    | '\u0A70'..'\u0A71'
    | '\u0A81'..'\u0A83'
    | '\u0ABC'
    | '\u0ABE'..'\u0AC5'
    | '\u0AC7'..'\u0AC9'
    | '\u0ACB'..'\u0ACD'
    | '\u0B01'..'\u0B03'
    | '\u0B3C'
    | '\u0B3E'..'\u0B43'
    | '\u0B47'..'\u0B48'
    | '\u0B4B'..'\u0B4D'
    | '\u0B56'..'\u0B57'
    | '\u0B82'..'\u0B83'
    | '\u0BBE'..'\u0BC2'
    | '\u0BC6'..'\u0BC8'
    | '\u0BCA'..'\u0BCD'
    | '\u0BD7'
    | '\u0C01'..'\u0C03'
    | '\u0C3E'..'\u0C44'
    | '\u0C46'..'\u0C48'
    | '\u0C4A'..'\u0C4D'
    | '\u0C55'..'\u0C56'
    | '\u0C82'..'\u0C83'
    | '\u0CBE'..'\u0CC4'
    | '\u0CC6'..'\u0CC8'
    | '\u0CCA'..'\u0CCD'
    | '\u0CD5'..'\u0CD6'
    | '\u0D02'..'\u0D03'
    | '\u0D3E'..'\u0D43'
    | '\u0D46'..'\u0D48'
    | '\u0D4A'..'\u0D4D'
    | '\u0D57'
    | '\u0D82'..'\u0D83'
    | '\u0DCA'
    | '\u0DCF'..'\u0DD4'
    | '\u0DD6'
    | '\u0DD8'..'\u0DDF'
    | '\u0DF2'..'\u0DF3'
    | '\u0E31'
    | '\u0E34'..'\u0E3A'
    | '\u0E47'..'\u0E4E'
    | '\u0EB1'
    | '\u0EB4'..'\u0EB9'
    | '\u0EBB'..'\u0EBC'
    | '\u0EC8'..'\u0ECD'
    | '\u0F18'..'\u0F19'
    | '\u0F35'
    | '\u0F37'
    | '\u0F39'
    | '\u0F3E'..'\u0F3F'
    | '\u0F71'..'\u0F84'
    | '\u0F86'..'\u0F87'
    | '\u0F90'..'\u0F97'
    | '\u0F99'..'\u0FBC'
    | '\u0FC6'
    | '\u102C'..'\u1032'
    | '\u1036'..'\u1039'
    | '\u1056'..'\u1059'
    | '\u17B4'..'\u17D3'
    | '\u18A9'
    | '\u20D0'..'\u20DC'
    | '\u20E1'
    | '\u302A'..'\u302F'
    | '\u3099'..'\u309A'
    | '\uFB1E'
    | '\uFE20'..'\uFE23'
    ;

fragment UnicodeDigit       // Any character in the Unicode category "Decimal number (Nd)".
    : '\u0030'..'\u0039'
    | '\u0660'..'\u0669'
    | '\u06F0'..'\u06F9'
    | '\u0966'..'\u096F'
    | '\u09E6'..'\u09EF'
    | '\u0A66'..'\u0A6F'
    | '\u0AE6'..'\u0AEF'
    | '\u0B66'..'\u0B6F'
    | '\u0BE7'..'\u0BEF'
    | '\u0C66'..'\u0C6F'
    | '\u0CE6'..'\u0CEF'
    | '\u0D66'..'\u0D6F'
    | '\u0E50'..'\u0E59'
    | '\u0ED0'..'\u0ED9'
    | '\u0F20'..'\u0F29'
    | '\u1040'..'\u1049'
    | '\u1369'..'\u1371'
    | '\u17E0'..'\u17E9'
    | '\u1810'..'\u1819'
    | '\uFF10'..'\uFF19'
    ;

fragment UnicodeConnectorPunctuation    // Any character in the Unicode category "Connector punctuation (Pc)".
    : '\u005F'
    | '\u203F'..'\u2040'
    | '\u30FB'
    | '\uFE33'..'\uFE34'
    | '\uFE4D'..'\uFE4F'
    | '\uFF3F'
    | '\uFF65'
    ;

Comment
    : '/*' (options {greedy=false;} : .)* '*/' //{ $channel=HIDDEN; }
    ;

LineComment
    : '//' (~LineTermChar)* //{ $channel=HIDDEN; }
    ;

LT
    : LineTermChar
    ;

fragment LineTermChar
    : '\n'      // Line feed.
    | '\r'      // Carriage return.
    | '\u2028'  // Line separator.
    | '\u2029'  // Paragraph separator.
    ;

WhiteSpace // Tab, vertical tab, form feed, space, non-breaking space and any other unicode "space separator".
    : WhiteSpaceChar+ //{ $channel=HIDDEN; }
    ;

fragment WhiteSpaceChar
    : '\t' 
    | '\u000B' 
    | '\f' 
    | ' ' 
    | '\u00A0'
    ;

Word
    : ((~NonWordChar)|'\\' .)+
    ;

fragment NonWordChar
    : LineTermChar
    | WhiteSpaceChar
    | '\\'
    | ','
    | '{'
    | '}'
    ;