[antlr-interest] [newbie] Quoted identifiers vs. string literals

Eric researcher0x00 at gmail.com
Sun Mar 18 13:48:49 PDT 2012


Hi Chuck,

The below grammar worked for me for   "test" Integer "01" and some other
basic test

Be careful with the grammar; it can easily cause a new person lots of
problems. The main reason is that you have
1. Keywords
2. Identifiers
3. String Literals
4. Whitespace
which are all either subsets or a partial set of another. I spent 90% of my
time setting up the rules to keep them corralled and in the right order.

The main changes I made were
1. pulled all of the string literals out of the parser rules
2. Used Ters example for keywords See:
http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741
3. Created a TYPE lexer rule so that the types wouldn't become ID.
4. Changed the WS rule, mostly added +
5. Pulled the quotes out as a separate token
6. Moved UnquotedString to be the last rule since it tries to consume
nearly everything.
7. Added space and tab to the negation rule for UnquotedString. I avoid
negation in lexer rules like the plague, they always lead to a problem. The
UnquotedString rule can become the death of you if you don't respect it.

Also using ANTLRWorks "Show Input Tokens" under the run menu revealed that
the space at the end of the type and before the quote was not being pulled
out as a WS token and that was causing a big problem.

Enjoy, Eric



grammar Chuck001;
// Parser Rles
triplet : Quote ID Quote type Quote UnquotedString Quote ;

type :  keyINTEGER
 | keyBOOLEAN
 | keySTRING
 ;

keyBOOLEAN : {input.LT(1).getText().equals("Boolean")}? TYPE;
keyINTEGER  : {input.LT(1).getText().equals("Integer")}? TYPE;
keySTRING : {input.LT(1).getText().equals("String")}? TYPE;


// Lexer Rules
Quote : '"';
TYPE : ('A'..'Z')('a'..'z')*
 ;
ID  : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
 ;

COMMENT :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
 |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
 ;
WS :
 ( '\t'
 | ' '
 | '\r'
 | '\n'
 | '\u000C'
 ) + { $channel = HIDDEN; }
 ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;
fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;
fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

UnquotedString
    :   ( ESC_SEQ | ~('\\'|'"'|' '|'\t') )*
    ;

On Sun, Mar 18, 2012 at 3:01 PM, Charles Daniels <cjdaniels4 at gmail.com>wrote:

> Hi Eric,
>
> Thanks for the quick response. I have downloaded ANTLRWorks 1.4.2 and
> created a fresh grammar using some of the defaults that the tool generates.
> Below is my grammar.
>
> This grammar successfully parses the following input:
>
> name String "value"
>
>
> However, I want to modify this grammar so that it will successfully parse
> the following input instead:
>
> "name" String "value"
>
>
> In attempting to do this, I modified the grammar below by adding double
> quotes around ID, like so:
>
> ID  : '"' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* '"'
>     ;
>
>
> However, parsing fails (MissingTokenException) for the desired input. I'm
> guessing that's because "value" is matched to ID rather than to STRING,
> when I add the quotes around ID.
>
> Is there any way to get "value" to match STRING instead of matching ID
> when I add quotes to ID? Will backtracking help? If so, how would I specify
> it?
>
> Thanks,
> Chuck
>
> --- BEGIN GRAMMAR ---
> grammar Config;
>
> triplet : ID type STRING
> ;
>  type : 'Boolean' | 'Integer' | 'String'
>  ;
>
> ID  : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>     ;
>
> COMMENT
>     :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>     |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>     ;
>
> WS  :   ( ' '
>         | '\t'
>         | '\r'
>         | '\n'
>         ) {$channel=HIDDEN;}
>     ;
>
> STRING
>     :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
>     ;
>
> fragment
> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
>
> fragment
> ESC_SEQ
>     :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>     |   UNICODE_ESC
>     |   OCTAL_ESC
>     ;
>
> fragment
> OCTAL_ESC
>     :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>     |   '\\' ('0'..'7') ('0'..'7')
>     |   '\\' ('0'..'7')
>     ;
>
> fragment
> UNICODE_ESC
>     :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>     ;
> --- END GRAMMAR ---
>
>
> On Sun, Mar 18, 2012 at 12:27 PM, Eric <researcher0x00 at gmail.com> wrote:
>
>> Hi Chuck,
>>
>> Off the top of my head I would guess that STRINGLITERAL  is trumping
>> IDENTIFIER. In other works the lexer generates tokens. The tokens are
>> created based on the rules in the lexer. Since STRINGLITERAL  comes before
>> IDENTIFIER, anything that matches STRINGLITERAL will make a
>> STRINGLITERAL  token even if STRINGLITERAL  defines the same character
>> string patterns as IDENTIFIER, i.e.  '"' ( ~('\\'|'"') )* '"' trumps '"'
>> IdentifierStart IdentifierPart* '"'
>>
>> Can you post your full grammar. I am having to guess at (copied from
>> Java.g) and believe I have something different.
>>
>> Also I strongly recommend using ANTLRWorks 1.4.2 for a new user. Note
>> this is not the latest version of ANTLRWorks which is 1.4.3. I am not
>> recommending ANTLRWorks 1.4.3 because it can not draw the NFA and DFA
>> diagrams due to a bug and when I new user tries to do this and it doesn't
>> work they think they did something wrong when if fact it is a bug from
>> ANTLR 3.4 which is used by ANTLRWorks 1.4.3.
>>
>> Also, you can search previous post to the list by using
>> http://antlr.markmail.org/
>>
>> Hope that helps, Eric
>>
>>
>>
>>
>> On Sun, Mar 18, 2012 at 11:22 AM, Charles Daniels <cjdaniels4 at gmail.com>wrote:
>>
>>>  I am very new to ANTLR and I having trouble properly defining part of a
>>> grammar that I am constructing to recognize a specialized configuration
>>> file syntax (already defined, so I cannot change it).
>>>
>>> The trouble I am having is recognizing the following type of entry in the
>>> config file:
>>>
>>> "name" type "value"
>>>
>>>
>>> where the following rules apply:
>>>
>>>   1. The double quotes are a required part of the syntax, both for the
>>>   name and the value.
>>>   2. A "name" is essentially a Java identifier
>>>   3. A "value" is a string literal
>>>
>>>
>>> I have the following grammar so far:
>>>
>>> triplet : IDENTIFIER type STRINGLITERAL ;
>>>
>>> type : 'Boolean' | 'Integer' | 'String' ;
>>>
>>> STRINGLITERAL : (copied from Java.g)
>>>
>>> IDENTIFIER : '"' IdentifierStart IdentifierPart* '"' ;
>>>
>>> IdentifierStart : (copied from Java.g)
>>>
>>> IdentifierPart : (copied from Java.g)
>>>
>>> When I compile this grammar, ANTLR hangs. When I remove the double quotes
>>> from IDENTIFIER, it compiles successfully. I am guessing that when I
>>> include the double quotes in IDENTIFIER they are somehow causing the
>>> compilation to hang due to the double quotes that are also in
>>> STRINGLITERAL.
>>>
>>> Does anybody have any suggestions on how to define this so that I can use
>>> double quotes around names and values and the compiler doesn't hang?
>>>
>>> Thanks,
>>> Chuck
>>>
>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe:
>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>
>>
>>
>


More information about the antlr-interest mailing list