[antlr-interest] [newbie] Quoted identifiers vs. string literals

Charles Daniels cjdaniels4 at gmail.com
Sun Mar 18 14:31:16 PDT 2012


Hi Eric,

I really appreciate the amount of time you're putting into helping me.

And yes, regarding the string literals, I do want to allow whitespace
characters (blank, tab, line feed, and carriage return).

Regarding the rules for keyBOOLEAN, etc., would you mind explaining just a
bit about what they are doing, particularly with the trailing TYPE?

Finally, regarding the triplet rule, won't the way you've written it permit
whitespace between the Quote tokens and the other tokens? I was putting the
quote characters within the lexer rules so that this wouldn't happen for
the ID. For the string literal, if the whitespace is captured separately
from the string literal, then the whitespace won't be a part of the string
literal, which wouldn't be right. Am I understanding things correctly here?

Thanks a lot!
Chuck

On Sun, Mar 18, 2012 at 5:19 PM, Eric <researcher0x00 at gmail.com> wrote:

> Hi Chuck,
>
> Opps. I just realized that you probably want to have spaces in your
> StringLiteral and the grammar I just gave you doesn't allow that.
>
> I look at it some more.
>
> Eric
>
>
>
> On Sun, Mar 18, 2012 at 4:48 PM, Eric <researcher0x00 at gmail.com> wrote:
>
>> Hi Chuck,
>>
>> The below grammar worked for me for   "test" Integer "01" and some other
>> basic test
>>
>> Be careful with the grammar; it can easily cause a new person lots of
>> problems. The main reason is that you have
>> 1. Keywords
>> 2. Identifiers
>> 3. String Literals
>> 4. Whitespace
>> which are all either subsets or a partial set of another. I spent 90% of
>> my time setting up the rules to keep them corralled and in the right order.
>>
>> The main changes I made were
>> 1. pulled all of the string literals out of the parser rules
>> 2. Used Ters example for keywords See:
>> http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741
>> 3. Created a TYPE lexer rule so that the types wouldn't become ID.
>> 4. Changed the WS rule, mostly added +
>> 5. Pulled the quotes out as a separate token
>> 6. Moved UnquotedString to be the last rule since it tries to consume
>> nearly everything.
>> 7. Added space and tab to the negation rule for UnquotedString. I avoid
>> negation in lexer rules like the plague, they always lead to a problem. The
>> UnquotedString rule can become the death of you if you don't respect it.
>>
>> Also using ANTLRWorks "Show Input Tokens" under the run menu revealed
>> that the space at the end of the type and before the quote was not being
>> pulled out as a WS token and that was causing a big problem.
>>
>> Enjoy, Eric
>>
>>
>>
>> grammar Chuck001;
>> // Parser Rles
>> triplet : Quote ID Quote type Quote UnquotedString Quote ;
>>
>> type :  keyINTEGER
>>  | keyBOOLEAN
>>  | keySTRING
>>  ;
>>
>> keyBOOLEAN : {input.LT(1).getText().equals("Boolean")}? TYPE;
>> keyINTEGER  : {input.LT(1).getText().equals("Integer")}? TYPE;
>> keySTRING : {input.LT(1).getText().equals("String")}? TYPE;
>>
>>
>> // Lexer Rules
>> Quote : '"';
>> TYPE : ('A'..'Z')('a'..'z')*
>>  ;
>> ID  : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>>  ;
>>
>> COMMENT :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>>  |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>>  ;
>> WS :
>>  ( '\t'
>>  | ' '
>>  | '\r'
>>  | '\n'
>>  | '\u000C'
>>  ) + { $channel = HIDDEN; }
>>  ;
>> fragment
>> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
>> fragment
>> ESC_SEQ
>>     :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>>     |   UNICODE_ESC
>>     |   OCTAL_ESC
>>     ;
>> fragment
>> OCTAL_ESC
>>     :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>>     |   '\\' ('0'..'7') ('0'..'7')
>>     |   '\\' ('0'..'7')
>>     ;
>> fragment
>> UNICODE_ESC
>>     :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>>     ;
>>
>> UnquotedString
>>     :   ( ESC_SEQ | ~('\\'|'"'|' '|'\t') )*
>>     ;
>>
>> On Sun, Mar 18, 2012 at 3:01 PM, Charles Daniels <cjdaniels4 at gmail.com>wrote:
>>
>>> Hi Eric,
>>>
>>> Thanks for the quick response. I have downloaded ANTLRWorks 1.4.2 and
>>> created a fresh grammar using some of the defaults that the tool generates.
>>> Below is my grammar.
>>>
>>> This grammar successfully parses the following input:
>>>
>>> name String "value"
>>>
>>>
>>> However, I want to modify this grammar so that it will successfully
>>> parse the following input instead:
>>>
>>> "name" String "value"
>>>
>>>
>>> In attempting to do this, I modified the grammar below by adding double
>>> quotes around ID, like so:
>>>
>>> ID  : '"' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* '"'
>>>     ;
>>>
>>>
>>> However, parsing fails (MissingTokenException) for the desired input.
>>> I'm guessing that's because "value" is matched to ID rather than to STRING,
>>> when I add the quotes around ID.
>>>
>>> Is there any way to get "value" to match STRING instead of matching ID
>>> when I add quotes to ID? Will backtracking help? If so, how would I specify
>>> it?
>>>
>>> Thanks,
>>> Chuck
>>>
>>> --- BEGIN GRAMMAR ---
>>> grammar Config;
>>>
>>> triplet : ID type STRING
>>> ;
>>>  type : 'Boolean' | 'Integer' | 'String'
>>>  ;
>>>
>>> ID  : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>>>     ;
>>>
>>> COMMENT
>>>     :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>>>     |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>>>     ;
>>>
>>> WS  :   ( ' '
>>>         | '\t'
>>>         | '\r'
>>>         | '\n'
>>>         ) {$channel=HIDDEN;}
>>>     ;
>>>
>>> STRING
>>>     :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
>>>     ;
>>>
>>> fragment
>>> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
>>>
>>> fragment
>>> ESC_SEQ
>>>     :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>>>     |   UNICODE_ESC
>>>     |   OCTAL_ESC
>>>     ;
>>>
>>> fragment
>>> OCTAL_ESC
>>>     :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>>>     |   '\\' ('0'..'7') ('0'..'7')
>>>     |   '\\' ('0'..'7')
>>>     ;
>>>
>>> fragment
>>> UNICODE_ESC
>>>     :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>>>     ;
>>> --- END GRAMMAR ---
>>>
>>>
>>> On Sun, Mar 18, 2012 at 12:27 PM, Eric <researcher0x00 at gmail.com> wrote:
>>>
>>>> Hi Chuck,
>>>>
>>>> Off the top of my head I would guess that STRINGLITERAL  is trumping
>>>> IDENTIFIER. In other works the lexer generates tokens. The tokens are
>>>> created based on the rules in the lexer. Since STRINGLITERAL  comes before
>>>> IDENTIFIER, anything that matches STRINGLITERAL will make a
>>>> STRINGLITERAL  token even if STRINGLITERAL  defines the same character
>>>> string patterns as IDENTIFIER, i.e.  '"' ( ~('\\'|'"') )* '"' trumps '"'
>>>> IdentifierStart IdentifierPart* '"'
>>>>
>>>> Can you post your full grammar. I am having to guess at (copied from
>>>> Java.g) and believe I have something different.
>>>>
>>>> Also I strongly recommend using ANTLRWorks 1.4.2 for a new user. Note
>>>> this is not the latest version of ANTLRWorks which is 1.4.3. I am not
>>>> recommending ANTLRWorks 1.4.3 because it can not draw the NFA and DFA
>>>> diagrams due to a bug and when I new user tries to do this and it doesn't
>>>> work they think they did something wrong when if fact it is a bug from
>>>> ANTLR 3.4 which is used by ANTLRWorks 1.4.3.
>>>>
>>>> Also, you can search previous post to the list by using
>>>> http://antlr.markmail.org/
>>>>
>>>> Hope that helps, Eric
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Mar 18, 2012 at 11:22 AM, Charles Daniels <cjdaniels4 at gmail.com
>>>> > wrote:
>>>>
>>>>>  I am very new to ANTLR and I having trouble properly defining part of
>>>>> a
>>>>> grammar that I am constructing to recognize a specialized configuration
>>>>> file syntax (already defined, so I cannot change it).
>>>>>
>>>>> The trouble I am having is recognizing the following type of entry in
>>>>> the
>>>>> config file:
>>>>>
>>>>> "name" type "value"
>>>>>
>>>>>
>>>>> where the following rules apply:
>>>>>
>>>>>   1. The double quotes are a required part of the syntax, both for the
>>>>>   name and the value.
>>>>>   2. A "name" is essentially a Java identifier
>>>>>   3. A "value" is a string literal
>>>>>
>>>>>
>>>>> I have the following grammar so far:
>>>>>
>>>>> triplet : IDENTIFIER type STRINGLITERAL ;
>>>>>
>>>>> type : 'Boolean' | 'Integer' | 'String' ;
>>>>>
>>>>> STRINGLITERAL : (copied from Java.g)
>>>>>
>>>>> IDENTIFIER : '"' IdentifierStart IdentifierPart* '"' ;
>>>>>
>>>>> IdentifierStart : (copied from Java.g)
>>>>>
>>>>> IdentifierPart : (copied from Java.g)
>>>>>
>>>>> When I compile this grammar, ANTLR hangs. When I remove the double
>>>>> quotes
>>>>> from IDENTIFIER, it compiles successfully. I am guessing that when I
>>>>> include the double quotes in IDENTIFIER they are somehow causing the
>>>>> compilation to hang due to the double quotes that are also in
>>>>> STRINGLITERAL.
>>>>>
>>>>> Does anybody have any suggestions on how to define this so that I can
>>>>> use
>>>>> double quotes around names and values and the compiler doesn't hang?
>>>>>
>>>>> Thanks,
>>>>> Chuck
>>>>>
>>>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>>>> Unsubscribe:
>>>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>>>
>>>>
>>>>
>>>
>>
>


More information about the antlr-interest mailing list