[antlr-interest] [newbie] Quoted identifiers vs. string literals

Sun Mar 18 19:21:07 PDT 2012

Hi Chuck,

This should have it all.

I took some extra time to set up the tokens for a more usable AST.
If you use the ANTLRWorks debugger, you will see that the parse tree is
nicer, and I took the " off of the id.

Enjoy, Eric

grammar Chuck001;
// Parser Rles
triplet  : id  type  string;
id : ID ;

type : keyINTEGER
 | keyBOOLEAN
 | keySTRING
 ;

keyBOOLEAN : {input.LT(1).getText().equals("Boolean")}? Boolean;
keyINTEGER  : {input.LT(1).getText().equals("Integer")}? Integer;
keySTRING : {input.LT(1).getText().equals("String")}? String;
string :  STRING;

// Lexer Rules
Boolean : 'Boolean';
Integer : 'Integer';
String : 'String';

ID  :  '"' (('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*) '"'
 { setText(getText().substring(1,getText().length()-1));};

STRING  : '"' ( ESC_SEQ | ~('\\'|'"') )* '"' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ {$channel=HIDDEN;}
 ;

COMMENT :   '//' ~('\n'|'\r')* '\r'? '\n'   {$channel=HIDDEN;}
 |   '/*' ( options {greedy=false;} : . )* '*/'  {$channel=HIDDEN;}
 ;
fragment
HEX_DIGIT  : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;
fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;
fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

On Sun, Mar 18, 2012 at 5:57 PM, Eric <researcher0x00 at gmail.com> wrote:

>
>
> On Sun, Mar 18, 2012 at 5:31 PM, Charles Daniels <cjdaniels4 at gmail.com>wrote:
>
>> Hi Eric,
>>
>> I really appreciate the amount of time you're putting into helping me.
>>
>>
> Thanks. Don't worry about it though, I wouldn't do it if it wasn't fun.
>
>
>
>> And yes, regarding the string literals, I do want to allow whitespace
>> characters (blank, tab, line feed, and carriage return).
>>
>>
> Yea it's obvious, otherwise why have the quotes. The more I think about
> this, the more I think it should be done with regular expressions. I am
> trying that right now within C#. I know your using Java, but RE tend to
> universal if you don't use custom features. Just because it looks like a
> simple grammar that doesn't mean it can be parsed. Almost like a trick
> question.
>
>> Regarding the rules for keyBOOLEAN, etc., would you mind explaining just
>> a bit about what they are doing, particularly with the trailing TYPE?
>>
>>
> Basically the { input.LT(1).getText().equals('Boolean') }? TYPE
>
> breaks down as follows
>
> input - an object representing the input text to the parser
> LT(1) - a method of input to get the next token.
> getText() - for the token, get the text. A token has several properties
> including line, char pos, text, and so on.
> equals - a method for a string which is what we have from
> input.LT(1).getText(1)
> 'Boolean' - what we want to test the text in the token against
> { xyz }? - an ANTLR predicate. In this case think if statement. So If,
> e.g. { xyz }?, then something. So here if we see 'Boolean' then apply the
> TYPE rule, if not, then skip the rule. Everything between { } will be
> entered as is into the parser. The ? if I remeber right means it expects a
> bool result from the code in { }.
>
> For ANTLR predicate see the glossary
> http://www.antlr.org/doc/glossary.html#Predicate,_syntactic or "The
> Definitive ANTLR Reference"
>
>
>
>> Finally, regarding the triplet rule, won't the way you've written it
>> permit whitespace between the Quote tokens and the other tokens?
>>
>
> That's what bothers me, you would think it would, but something is
> changing that. I suspect the lexer rules, because there is not much else it
> could be. Part of the reason I am working on this is it is a great and
> simple case for learning. I have never had to apply such a combination in
> such a simple case.
>
>
>> I was putting the quote characters within the lexer rules so that this
>> wouldn't happen for the ID. For the string literal, if the whitespace is
>> captured separately from the string literal, then the whitespace won't be a
>> part of the string literal, which wouldn't be right. Am I understanding
>> things correctly here?
>>
>>
> Sounds right to me.
>
>>
>>
>> Thanks a lot!
>> Chuck
>>
>> On Sun, Mar 18, 2012 at 5:19 PM, Eric <researcher0x00 at gmail.com> wrote:
>>
>>> Hi Chuck,
>>>
>>> Opps. I just realized that you probably want to have spaces in your
>>> StringLiteral and the grammar I just gave you doesn't allow that.
>>>
>>> I look at it some more.
>>>
>>> Eric
>>>
>>>
>>>
>>> On Sun, Mar 18, 2012 at 4:48 PM, Eric <researcher0x00 at gmail.com> wrote:
>>>
>>>> Hi Chuck,
>>>>
>>>> The below grammar worked for me for   "test" Integer "01" and some
>>>> other basic test
>>>>
>>>> Be careful with the grammar; it can easily cause a new person lots of
>>>> problems. The main reason is that you have
>>>> 1. Keywords
>>>> 2. Identifiers
>>>> 3. String Literals
>>>> 4. Whitespace
>>>> which are all either subsets or a partial set of another. I spent 90%
>>>> of my time setting up the rules to keep them corralled and in the right
>>>> order.
>>>>
>>>> The main changes I made were
>>>> 1. pulled all of the string literals out of the parser rules
>>>> 2. Used Ters example for keywords See:
>>>> http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741
>>>> 3. Created a TYPE lexer rule so that the types wouldn't become ID.
>>>> 4. Changed the WS rule, mostly added +
>>>> 5. Pulled the quotes out as a separate token
>>>> 6. Moved UnquotedString to be the last rule since it tries to consume
>>>> nearly everything.
>>>> 7. Added space and tab to the negation rule for UnquotedString. I avoid
>>>> negation in lexer rules like the plague, they always lead to a problem. The
>>>> UnquotedString rule can become the death of you if you don't respect it.
>>>>
>>>> Also using ANTLRWorks "Show Input Tokens" under the run menu revealed
>>>> that the space at the end of the type and before the quote was not being
>>>> pulled out as a WS token and that was causing a big problem.
>>>>
>>>> Enjoy, Eric
>>>>
>>>>
>>>>
>>>> grammar Chuck001;
>>>> // Parser Rles
>>>> triplet : Quote ID Quote type Quote UnquotedString Quote ;
>>>>
>>>> type :  keyINTEGER
>>>>  | keyBOOLEAN
>>>>  | keySTRING
>>>>  ;
>>>>
>>>> keyBOOLEAN : {input.LT(1).getText().equals("Boolean")}? TYPE;
>>>> keyINTEGER  : {input.LT(1).getText().equals("Integer")}? TYPE;
>>>> keySTRING : {input.LT(1).getText().equals("String")}? TYPE;
>>>>
>>>>
>>>> // Lexer Rules
>>>> Quote : '"';
>>>> TYPE : ('A'..'Z')('a'..'z')*
>>>>  ;
>>>> ID  : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>>>>  ;
>>>>
>>>> COMMENT :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>>>>  |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>>>>  ;
>>>> WS :
>>>>  ( '\t'
>>>>  | ' '
>>>>  | '\r'
>>>>  | '\n'
>>>>  | '\u000C'
>>>>  ) + { $channel = HIDDEN; }
>>>>  ;
>>>> fragment
>>>> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
>>>> fragment
>>>> ESC_SEQ
>>>>     :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>>>>     |   UNICODE_ESC
>>>>     |   OCTAL_ESC
>>>>     ;
>>>> fragment
>>>> OCTAL_ESC
>>>>     :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>>>>     |   '\\' ('0'..'7') ('0'..'7')
>>>>     |   '\\' ('0'..'7')
>>>>     ;
>>>> fragment
>>>> UNICODE_ESC
>>>>     :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>>>>     ;
>>>>
>>>> UnquotedString
>>>>     :   ( ESC_SEQ | ~('\\'|'"'|' '|'\t') )*
>>>>     ;
>>>>
>>>> On Sun, Mar 18, 2012 at 3:01 PM, Charles Daniels <cjdaniels4 at gmail.com>wrote:
>>>>
>>>>> Hi Eric,
>>>>>
>>>>> Thanks for the quick response. I have downloaded ANTLRWorks 1.4.2 and
>>>>> created a fresh grammar using some of the defaults that the tool generates.
>>>>> Below is my grammar.
>>>>>
>>>>> This grammar successfully parses the following input:
>>>>>
>>>>> name String "value"
>>>>>
>>>>>
>>>>> However, I want to modify this grammar so that it will successfully
>>>>> parse the following input instead:
>>>>>
>>>>> "name" String "value"
>>>>>
>>>>>
>>>>> In attempting to do this, I modified the grammar below by adding
>>>>> double quotes around ID, like so:
>>>>>
>>>>> ID  : '"' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>>>>> '"'
>>>>>     ;
>>>>>
>>>>>
>>>>> However, parsing fails (MissingTokenException) for the desired input.
>>>>> I'm guessing that's because "value" is matched to ID rather than to STRING,
>>>>> when I add the quotes around ID.
>>>>>
>>>>> Is there any way to get "value" to match STRING instead of matching ID
>>>>> when I add quotes to ID? Will backtracking help? If so, how would I specify
>>>>> it?
>>>>>
>>>>> Thanks,
>>>>> Chuck
>>>>>
>>>>> --- BEGIN GRAMMAR ---
>>>>> grammar Config;
>>>>>
>>>>> triplet : ID type STRING
>>>>> ;
>>>>>  type : 'Boolean' | 'Integer' | 'String'
>>>>>  ;
>>>>>
>>>>> ID  : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>>>>>     ;
>>>>>
>>>>> COMMENT
>>>>>     :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>>>>>     |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>>>>>     ;
>>>>>
>>>>> WS  :   ( ' '
>>>>>         | '\t'
>>>>>         | '\r'
>>>>>         | '\n'
>>>>>         ) {$channel=HIDDEN;}
>>>>>     ;
>>>>>
>>>>> STRING
>>>>>     :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
>>>>>     ;
>>>>>
>>>>> fragment
>>>>> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
>>>>>
>>>>> fragment
>>>>> ESC_SEQ
>>>>>     :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>>>>>     |   UNICODE_ESC
>>>>>     |   OCTAL_ESC
>>>>>     ;
>>>>>
>>>>> fragment
>>>>> OCTAL_ESC
>>>>>     :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>>>>>     |   '\\' ('0'..'7') ('0'..'7')
>>>>>     |   '\\' ('0'..'7')
>>>>>     ;
>>>>>
>>>>> fragment
>>>>> UNICODE_ESC
>>>>>     :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>>>>>     ;
>>>>> --- END GRAMMAR ---
>>>>>
>>>>>
>>>>> On Sun, Mar 18, 2012 at 12:27 PM, Eric <researcher0x00 at gmail.com>wrote:
>>>>>
>>>>>> Hi Chuck,
>>>>>>
>>>>>> Off the top of my head I would guess that STRINGLITERAL  is trumping
>>>>>> IDENTIFIER. In other works the lexer generates tokens. The tokens are
>>>>>> created based on the rules in the lexer. Since STRINGLITERAL  comes before
>>>>>> IDENTIFIER, anything that matches STRINGLITERAL will make a
>>>>>> STRINGLITERAL  token even if STRINGLITERAL  defines the same character
>>>>>> string patterns as IDENTIFIER, i.e.  '"' ( ~('\\'|'"') )* '"' trumps '"'
>>>>>> IdentifierStart IdentifierPart* '"'
>>>>>>
>>>>>> Can you post your full grammar. I am having to guess at (copied from
>>>>>> Java.g) and believe I have something different.
>>>>>>
>>>>>> Also I strongly recommend using ANTLRWorks 1.4.2 for a new user. Note
>>>>>> this is not the latest version of ANTLRWorks which is 1.4.3. I am not
>>>>>> recommending ANTLRWorks 1.4.3 because it can not draw the NFA and DFA
>>>>>> diagrams due to a bug and when I new user tries to do this and it doesn't
>>>>>> work they think they did something wrong when if fact it is a bug from
>>>>>> ANTLR 3.4 which is used by ANTLRWorks 1.4.3.
>>>>>>
>>>>>> Also, you can search previous post to the list by using
>>>>>> http://antlr.markmail.org/
>>>>>>
>>>>>> Hope that helps, Eric
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 18, 2012 at 11:22 AM, Charles Daniels <
>>>>>> cjdaniels4 at gmail.com> wrote:
>>>>>>
>>>>>>>  I am very new to ANTLR and I having trouble properly defining part
>>>>>>> of a
>>>>>>> grammar that I am constructing to recognize a specialized
>>>>>>> configuration
>>>>>>> file syntax (already defined, so I cannot change it).
>>>>>>>
>>>>>>> The trouble I am having is recognizing the following type of entry
>>>>>>> in the
>>>>>>> config file:
>>>>>>>
>>>>>>> "name" type "value"
>>>>>>>
>>>>>>>
>>>>>>> where the following rules apply:
>>>>>>>
>>>>>>>   1. The double quotes are a required part of the syntax, both for
>>>>>>> the
>>>>>>>   name and the value.
>>>>>>>   2. A "name" is essentially a Java identifier
>>>>>>>   3. A "value" is a string literal
>>>>>>>
>>>>>>>
>>>>>>> I have the following grammar so far:
>>>>>>>
>>>>>>> triplet : IDENTIFIER type STRINGLITERAL ;
>>>>>>>
>>>>>>> type : 'Boolean' | 'Integer' | 'String' ;
>>>>>>>
>>>>>>> STRINGLITERAL : (copied from Java.g)
>>>>>>>
>>>>>>> IDENTIFIER : '"' IdentifierStart IdentifierPart* '"' ;
>>>>>>>
>>>>>>> IdentifierStart : (copied from Java.g)
>>>>>>>
>>>>>>> IdentifierPart : (copied from Java.g)
>>>>>>>
>>>>>>> When I compile this grammar, ANTLR hangs. When I remove the double
>>>>>>> quotes
>>>>>>> from IDENTIFIER, it compiles successfully. I am guessing that when I
>>>>>>> include the double quotes in IDENTIFIER they are somehow causing the
>>>>>>> compilation to hang due to the double quotes that are also in
>>>>>>> STRINGLITERAL.
>>>>>>>
>>>>>>> Does anybody have any suggestions on how to define this so that I
>>>>>>> can use
>>>>>>> double quotes around names and values and the compiler doesn't hang?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Chuck
>>>>>>>
>>>>>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>>>>>> Unsubscribe:
>>>>>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>