[antlr-interest] [newbie] Quoted identifiers vs. string literals

Sun Mar 18 14:19:46 PDT 2012

Hi Chuck,

Opps. I just realized that you probably want to have spaces in your
StringLiteral and the grammar I just gave you doesn't allow that.

I look at it some more.

Eric

On Sun, Mar 18, 2012 at 4:48 PM, Eric <researcher0x00 at gmail.com> wrote:

> Hi Chuck,
>
> The below grammar worked for me for   "test" Integer "01" and some other
> basic test
>
> Be careful with the grammar; it can easily cause a new person lots of
> problems. The main reason is that you have
> 1. Keywords
> 2. Identifiers
> 3. String Literals
> 4. Whitespace
> which are all either subsets or a partial set of another. I spent 90% of
> my time setting up the rules to keep them corralled and in the right order.
>
> The main changes I made were
> 1. pulled all of the string literals out of the parser rules
> 2. Used Ters example for keywords See:
> http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741
> 3. Created a TYPE lexer rule so that the types wouldn't become ID.
> 4. Changed the WS rule, mostly added +
> 5. Pulled the quotes out as a separate token
> 6. Moved UnquotedString to be the last rule since it tries to consume
> nearly everything.
> 7. Added space and tab to the negation rule for UnquotedString. I avoid
> negation in lexer rules like the plague, they always lead to a problem. The
> UnquotedString rule can become the death of you if you don't respect it.
>
> Also using ANTLRWorks "Show Input Tokens" under the run menu revealed that
> the space at the end of the type and before the quote was not being pulled
> out as a WS token and that was causing a big problem.
>
> Enjoy, Eric
>
>
>
> grammar Chuck001;
> // Parser Rles
> triplet : Quote ID Quote type Quote UnquotedString Quote ;
>
> type :  keyINTEGER
>  | keyBOOLEAN
>  | keySTRING
>  ;
>
> keyBOOLEAN : {input.LT(1).getText().equals("Boolean")}? TYPE;
> keyINTEGER  : {input.LT(1).getText().equals("Integer")}? TYPE;
> keySTRING : {input.LT(1).getText().equals("String")}? TYPE;
>
>
> // Lexer Rules
> Quote : '"';
> TYPE : ('A'..'Z')('a'..'z')*
>  ;
> ID  : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>  ;
>
> COMMENT :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>  |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>  ;
> WS :
>  ( '\t'
>  | ' '
>  | '\r'
>  | '\n'
>  | '\u000C'
>  ) + { $channel = HIDDEN; }
>  ;
> fragment
> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
> fragment
> ESC_SEQ
>     :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>     |   UNICODE_ESC
>     |   OCTAL_ESC
>     ;
> fragment
> OCTAL_ESC
>     :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>     |   '\\' ('0'..'7') ('0'..'7')
>     |   '\\' ('0'..'7')
>     ;
> fragment
> UNICODE_ESC
>     :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>     ;
>
> UnquotedString
>     :   ( ESC_SEQ | ~('\\'|'"'|' '|'\t') )*
>     ;
>
> On Sun, Mar 18, 2012 at 3:01 PM, Charles Daniels <cjdaniels4 at gmail.com>wrote:
>
>> Hi Eric,
>>
>> Thanks for the quick response. I have downloaded ANTLRWorks 1.4.2 and
>> created a fresh grammar using some of the defaults that the tool generates.
>> Below is my grammar.
>>
>> This grammar successfully parses the following input:
>>
>> name String "value"
>>
>>
>> However, I want to modify this grammar so that it will successfully parse
>> the following input instead:
>>
>> "name" String "value"
>>
>>
>> In attempting to do this, I modified the grammar below by adding double
>> quotes around ID, like so:
>>
>> ID  : '"' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* '"'
>>     ;
>>
>>
>> However, parsing fails (MissingTokenException) for the desired input. I'm
>> guessing that's because "value" is matched to ID rather than to STRING,
>> when I add the quotes around ID.
>>
>> Is there any way to get "value" to match STRING instead of matching ID
>> when I add quotes to ID? Will backtracking help? If so, how would I specify
>> it?
>>
>> Thanks,
>> Chuck
>>
>> --- BEGIN GRAMMAR ---
>> grammar Config;
>>
>> triplet : ID type STRING
>> ;
>>  type : 'Boolean' | 'Integer' | 'String'
>>  ;
>>
>> ID  : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>>     ;
>>
>> COMMENT
>>     :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>>     |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>>     ;
>>
>> WS  :   ( ' '
>>         | '\t'
>>         | '\r'
>>         | '\n'
>>         ) {$channel=HIDDEN;}
>>     ;
>>
>> STRING
>>     :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
>>     ;
>>
>> fragment
>> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
>>
>> fragment
>> ESC_SEQ
>>     :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>>     |   UNICODE_ESC
>>     |   OCTAL_ESC
>>     ;
>>
>> fragment
>> OCTAL_ESC
>>     :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>>     |   '\\' ('0'..'7') ('0'..'7')
>>     |   '\\' ('0'..'7')
>>     ;
>>
>> fragment
>> UNICODE_ESC
>>     :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>>     ;
>> --- END GRAMMAR ---
>>
>>
>> On Sun, Mar 18, 2012 at 12:27 PM, Eric <researcher0x00 at gmail.com> wrote:
>>
>>> Hi Chuck,
>>>
>>> Off the top of my head I would guess that STRINGLITERAL  is trumping
>>> IDENTIFIER. In other works the lexer generates tokens. The tokens are
>>> created based on the rules in the lexer. Since STRINGLITERAL  comes before
>>> IDENTIFIER, anything that matches STRINGLITERAL will make a
>>> STRINGLITERAL  token even if STRINGLITERAL  defines the same character
>>> string patterns as IDENTIFIER, i.e.  '"' ( ~('\\'|'"') )* '"' trumps '"'
>>> IdentifierStart IdentifierPart* '"'
>>>
>>> Can you post your full grammar. I am having to guess at (copied from
>>> Java.g) and believe I have something different.
>>>
>>> Also I strongly recommend using ANTLRWorks 1.4.2 for a new user. Note
>>> this is not the latest version of ANTLRWorks which is 1.4.3. I am not
>>> recommending ANTLRWorks 1.4.3 because it can not draw the NFA and DFA
>>> diagrams due to a bug and when I new user tries to do this and it doesn't
>>> work they think they did something wrong when if fact it is a bug from
>>> ANTLR 3.4 which is used by ANTLRWorks 1.4.3.
>>>
>>> Also, you can search previous post to the list by using
>>> http://antlr.markmail.org/
>>>
>>> Hope that helps, Eric
>>>
>>>
>>>
>>>
>>> On Sun, Mar 18, 2012 at 11:22 AM, Charles Daniels <cjdaniels4 at gmail.com>wrote:
>>>
>>>>  I am very new to ANTLR and I having trouble properly defining part of a
>>>> grammar that I am constructing to recognize a specialized configuration
>>>> file syntax (already defined, so I cannot change it).
>>>>
>>>> The trouble I am having is recognizing the following type of entry in
>>>> the
>>>> config file:
>>>>
>>>> "name" type "value"
>>>>
>>>>
>>>> where the following rules apply:
>>>>
>>>>   1. The double quotes are a required part of the syntax, both for the
>>>>   name and the value.
>>>>   2. A "name" is essentially a Java identifier
>>>>   3. A "value" is a string literal
>>>>
>>>>
>>>> I have the following grammar so far:
>>>>
>>>> triplet : IDENTIFIER type STRINGLITERAL ;
>>>>
>>>> type : 'Boolean' | 'Integer' | 'String' ;
>>>>
>>>> STRINGLITERAL : (copied from Java.g)
>>>>
>>>> IDENTIFIER : '"' IdentifierStart IdentifierPart* '"' ;
>>>>
>>>> IdentifierStart : (copied from Java.g)
>>>>
>>>> IdentifierPart : (copied from Java.g)
>>>>
>>>> When I compile this grammar, ANTLR hangs. When I remove the double
>>>> quotes
>>>> from IDENTIFIER, it compiles successfully. I am guessing that when I
>>>> include the double quotes in IDENTIFIER they are somehow causing the
>>>> compilation to hang due to the double quotes that are also in
>>>> STRINGLITERAL.
>>>>
>>>> Does anybody have any suggestions on how to define this so that I can
>>>> use
>>>> double quotes around names and values and the compiler doesn't hang?
>>>>
>>>> Thanks,
>>>> Chuck
>>>>
>>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>>> Unsubscribe:
>>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>>
>>>
>>>
>>
>