[antlr-interest] [newbie] Quoted identifiers vs. string literals

Sun Mar 18 14:57:48 PDT 2012

On Sun, Mar 18, 2012 at 5:31 PM, Charles Daniels <cjdaniels4 at gmail.com>wrote:

> Hi Eric,
>
> I really appreciate the amount of time you're putting into helping me.
>
>
Thanks. Don't worry about it though, I wouldn't do it if it wasn't fun.

> And yes, regarding the string literals, I do want to allow whitespace
> characters (blank, tab, line feed, and carriage return).
>
>
Yea it's obvious, otherwise why have the quotes. The more I think about
this, the more I think it should be done with regular expressions. I am
trying that right now within C#. I know your using Java, but RE tend to
universal if you don't use custom features. Just because it looks like a
simple grammar that doesn't mean it can be parsed. Almost like a trick
question.

> Regarding the rules for keyBOOLEAN, etc., would you mind explaining just a
> bit about what they are doing, particularly with the trailing TYPE?
>
>
Basically the { input.LT(1).getText().equals('Boolean') }? TYPE

breaks down as follows

input - an object representing the input text to the parser
LT(1) - a method of input to get the next token.
getText() - for the token, get the text. A token has several properties
including line, char pos, text, and so on.
equals - a method for a string which is what we have from
input.LT(1).getText(1)
'Boolean' - what we want to test the text in the token against
{ xyz }? - an ANTLR predicate. In this case think if statement. So If, e.g.
{ xyz }?, then something. So here if we see 'Boolean' then apply the TYPE
rule, if not, then skip the rule. Everything between { } will be entered as
is into the parser. The ? if I remeber right means it expects a bool result
from the code in { }.

For ANTLR predicate see the glossary
http://www.antlr.org/doc/glossary.html#Predicate,_syntactic or "The
Definitive ANTLR Reference"

> Finally, regarding the triplet rule, won't the way you've written it
> permit whitespace between the Quote tokens and the other tokens?
>

That's what bothers me, you would think it would, but something is changing
that. I suspect the lexer rules, because there is not much else it could
be. Part of the reason I am working on this is it is a great and simple
case for learning. I have never had to apply such a combination in such a
simple case.

> I was putting the quote characters within the lexer rules so that this
> wouldn't happen for the ID. For the string literal, if the whitespace is
> captured separately from the string literal, then the whitespace won't be a
> part of the string literal, which wouldn't be right. Am I understanding
> things correctly here?
>
>
Sounds right to me.

>
>
> Thanks a lot!
> Chuck
>
> On Sun, Mar 18, 2012 at 5:19 PM, Eric <researcher0x00 at gmail.com> wrote:
>
>> Hi Chuck,
>>
>> Opps. I just realized that you probably want to have spaces in your
>> StringLiteral and the grammar I just gave you doesn't allow that.
>>
>> I look at it some more.
>>
>> Eric
>>
>>
>>
>> On Sun, Mar 18, 2012 at 4:48 PM, Eric <researcher0x00 at gmail.com> wrote:
>>
>>> Hi Chuck,
>>>
>>> The below grammar worked for me for   "test" Integer "01" and some other
>>> basic test
>>>
>>> Be careful with the grammar; it can easily cause a new person lots of
>>> problems. The main reason is that you have
>>> 1. Keywords
>>> 2. Identifiers
>>> 3. String Literals
>>> 4. Whitespace
>>> which are all either subsets or a partial set of another. I spent 90% of
>>> my time setting up the rules to keep them corralled and in the right order.
>>>
>>> The main changes I made were
>>> 1. pulled all of the string literals out of the parser rules
>>> 2. Used Ters example for keywords See:
>>> http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741
>>> 3. Created a TYPE lexer rule so that the types wouldn't become ID.
>>> 4. Changed the WS rule, mostly added +
>>> 5. Pulled the quotes out as a separate token
>>> 6. Moved UnquotedString to be the last rule since it tries to consume
>>> nearly everything.
>>> 7. Added space and tab to the negation rule for UnquotedString. I avoid
>>> negation in lexer rules like the plague, they always lead to a problem. The
>>> UnquotedString rule can become the death of you if you don't respect it.
>>>
>>> Also using ANTLRWorks "Show Input Tokens" under the run menu revealed
>>> that the space at the end of the type and before the quote was not being
>>> pulled out as a WS token and that was causing a big problem.
>>>
>>> Enjoy, Eric
>>>
>>>
>>>
>>> grammar Chuck001;
>>> // Parser Rles
>>> triplet : Quote ID Quote type Quote UnquotedString Quote ;
>>>
>>> type :  keyINTEGER
>>>  | keyBOOLEAN
>>>  | keySTRING
>>>  ;
>>>
>>> keyBOOLEAN : {input.LT(1).getText().equals("Boolean")}? TYPE;
>>> keyINTEGER  : {input.LT(1).getText().equals("Integer")}? TYPE;
>>> keySTRING : {input.LT(1).getText().equals("String")}? TYPE;
>>>
>>>
>>> // Lexer Rules
>>> Quote : '"';
>>> TYPE : ('A'..'Z')('a'..'z')*
>>>  ;
>>> ID  : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>>>  ;
>>>
>>> COMMENT :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>>>  |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>>>  ;
>>> WS :
>>>  ( '\t'
>>>  | ' '
>>>  | '\r'
>>>  | '\n'
>>>  | '\u000C'
>>>  ) + { $channel = HIDDEN; }
>>>  ;
>>> fragment
>>> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
>>> fragment
>>> ESC_SEQ
>>>     :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>>>     |   UNICODE_ESC
>>>     |   OCTAL_ESC
>>>     ;
>>> fragment
>>> OCTAL_ESC
>>>     :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>>>     |   '\\' ('0'..'7') ('0'..'7')
>>>     |   '\\' ('0'..'7')
>>>     ;
>>> fragment
>>> UNICODE_ESC
>>>     :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>>>     ;
>>>
>>> UnquotedString
>>>     :   ( ESC_SEQ | ~('\\'|'"'|' '|'\t') )*
>>>     ;
>>>
>>> On Sun, Mar 18, 2012 at 3:01 PM, Charles Daniels <cjdaniels4 at gmail.com>wrote:
>>>
>>>> Hi Eric,
>>>>
>>>> Thanks for the quick response. I have downloaded ANTLRWorks 1.4.2 and
>>>> created a fresh grammar using some of the defaults that the tool generates.
>>>> Below is my grammar.
>>>>
>>>> This grammar successfully parses the following input:
>>>>
>>>> name String "value"
>>>>
>>>>
>>>> However, I want to modify this grammar so that it will successfully
>>>> parse the following input instead:
>>>>
>>>> "name" String "value"
>>>>
>>>>
>>>> In attempting to do this, I modified the grammar below by adding double
>>>> quotes around ID, like so:
>>>>
>>>> ID  : '"' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* '"'
>>>>     ;
>>>>
>>>>
>>>> However, parsing fails (MissingTokenException) for the desired input.
>>>> I'm guessing that's because "value" is matched to ID rather than to STRING,
>>>> when I add the quotes around ID.
>>>>
>>>> Is there any way to get "value" to match STRING instead of matching ID
>>>> when I add quotes to ID? Will backtracking help? If so, how would I specify
>>>> it?
>>>>
>>>> Thanks,
>>>> Chuck
>>>>
>>>> --- BEGIN GRAMMAR ---
>>>> grammar Config;
>>>>
>>>> triplet : ID type STRING
>>>> ;
>>>>  type : 'Boolean' | 'Integer' | 'String'
>>>>  ;
>>>>
>>>> ID  : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>>>>     ;
>>>>
>>>> COMMENT
>>>>     :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>>>>     |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>>>>     ;
>>>>
>>>> WS  :   ( ' '
>>>>         | '\t'
>>>>         | '\r'
>>>>         | '\n'
>>>>         ) {$channel=HIDDEN;}
>>>>     ;
>>>>
>>>> STRING
>>>>     :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
>>>>     ;
>>>>
>>>> fragment
>>>> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
>>>>
>>>> fragment
>>>> ESC_SEQ
>>>>     :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>>>>     |   UNICODE_ESC
>>>>     |   OCTAL_ESC
>>>>     ;
>>>>
>>>> fragment
>>>> OCTAL_ESC
>>>>     :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>>>>     |   '\\' ('0'..'7') ('0'..'7')
>>>>     |   '\\' ('0'..'7')
>>>>     ;
>>>>
>>>> fragment
>>>> UNICODE_ESC
>>>>     :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>>>>     ;
>>>> --- END GRAMMAR ---
>>>>
>>>>
>>>> On Sun, Mar 18, 2012 at 12:27 PM, Eric <researcher0x00 at gmail.com>wrote:
>>>>
>>>>> Hi Chuck,
>>>>>
>>>>> Off the top of my head I would guess that STRINGLITERAL  is trumping
>>>>> IDENTIFIER. In other works the lexer generates tokens. The tokens are
>>>>> created based on the rules in the lexer. Since STRINGLITERAL  comes before
>>>>> IDENTIFIER, anything that matches STRINGLITERAL will make a
>>>>> STRINGLITERAL  token even if STRINGLITERAL  defines the same character
>>>>> string patterns as IDENTIFIER, i.e.  '"' ( ~('\\'|'"') )* '"' trumps '"'
>>>>> IdentifierStart IdentifierPart* '"'
>>>>>
>>>>> Can you post your full grammar. I am having to guess at (copied from
>>>>> Java.g) and believe I have something different.
>>>>>
>>>>> Also I strongly recommend using ANTLRWorks 1.4.2 for a new user. Note
>>>>> this is not the latest version of ANTLRWorks which is 1.4.3. I am not
>>>>> recommending ANTLRWorks 1.4.3 because it can not draw the NFA and DFA
>>>>> diagrams due to a bug and when I new user tries to do this and it doesn't
>>>>> work they think they did something wrong when if fact it is a bug from
>>>>> ANTLR 3.4 which is used by ANTLRWorks 1.4.3.
>>>>>
>>>>> Also, you can search previous post to the list by using
>>>>> http://antlr.markmail.org/
>>>>>
>>>>> Hope that helps, Eric
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Mar 18, 2012 at 11:22 AM, Charles Daniels <
>>>>> cjdaniels4 at gmail.com> wrote:
>>>>>
>>>>>>  I am very new to ANTLR and I having trouble properly defining part
>>>>>> of a
>>>>>> grammar that I am constructing to recognize a specialized
>>>>>> configuration
>>>>>> file syntax (already defined, so I cannot change it).
>>>>>>
>>>>>> The trouble I am having is recognizing the following type of entry in
>>>>>> the
>>>>>> config file:
>>>>>>
>>>>>> "name" type "value"
>>>>>>
>>>>>>
>>>>>> where the following rules apply:
>>>>>>
>>>>>>   1. The double quotes are a required part of the syntax, both for the
>>>>>>   name and the value.
>>>>>>   2. A "name" is essentially a Java identifier
>>>>>>   3. A "value" is a string literal
>>>>>>
>>>>>>
>>>>>> I have the following grammar so far:
>>>>>>
>>>>>> triplet : IDENTIFIER type STRINGLITERAL ;
>>>>>>
>>>>>> type : 'Boolean' | 'Integer' | 'String' ;
>>>>>>
>>>>>> STRINGLITERAL : (copied from Java.g)
>>>>>>
>>>>>> IDENTIFIER : '"' IdentifierStart IdentifierPart* '"' ;
>>>>>>
>>>>>> IdentifierStart : (copied from Java.g)
>>>>>>
>>>>>> IdentifierPart : (copied from Java.g)
>>>>>>
>>>>>> When I compile this grammar, ANTLR hangs. When I remove the double
>>>>>> quotes
>>>>>> from IDENTIFIER, it compiles successfully. I am guessing that when I
>>>>>> include the double quotes in IDENTIFIER they are somehow causing the
>>>>>> compilation to hang due to the double quotes that are also in
>>>>>> STRINGLITERAL.
>>>>>>
>>>>>> Does anybody have any suggestions on how to define this so that I can
>>>>>> use
>>>>>> double quotes around names and values and the compiler doesn't hang?
>>>>>>
>>>>>> Thanks,
>>>>>> Chuck
>>>>>>
>>>>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>>>>> Unsubscribe:
>>>>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>