[antlr-interest] [newbie] Quoted identifiers vs. string literals
Eric
researcher0x00 at gmail.com
Sun Mar 18 14:57:48 PDT 2012
On Sun, Mar 18, 2012 at 5:31 PM, Charles Daniels <cjdaniels4 at gmail.com>wrote:
> Hi Eric,
>
> I really appreciate the amount of time you're putting into helping me.
>
>
Thanks. Don't worry about it though, I wouldn't do it if it wasn't fun.
> And yes, regarding the string literals, I do want to allow whitespace
> characters (blank, tab, line feed, and carriage return).
>
>
Yea it's obvious, otherwise why have the quotes. The more I think about
this, the more I think it should be done with regular expressions. I am
trying that right now within C#. I know your using Java, but RE tend to
universal if you don't use custom features. Just because it looks like a
simple grammar that doesn't mean it can be parsed. Almost like a trick
question.
> Regarding the rules for keyBOOLEAN, etc., would you mind explaining just a
> bit about what they are doing, particularly with the trailing TYPE?
>
>
Basically the { input.LT(1).getText().equals('Boolean') }? TYPE
breaks down as follows
input - an object representing the input text to the parser
LT(1) - a method of input to get the next token.
getText() - for the token, get the text. A token has several properties
including line, char pos, text, and so on.
equals - a method for a string which is what we have from
input.LT(1).getText(1)
'Boolean' - what we want to test the text in the token against
{ xyz }? - an ANTLR predicate. In this case think if statement. So If, e.g.
{ xyz }?, then something. So here if we see 'Boolean' then apply the TYPE
rule, if not, then skip the rule. Everything between { } will be entered as
is into the parser. The ? if I remeber right means it expects a bool result
from the code in { }.
For ANTLR predicate see the glossary
http://www.antlr.org/doc/glossary.html#Predicate,_syntactic or "The
Definitive ANTLR Reference"
> Finally, regarding the triplet rule, won't the way you've written it
> permit whitespace between the Quote tokens and the other tokens?
>
That's what bothers me, you would think it would, but something is changing
that. I suspect the lexer rules, because there is not much else it could
be. Part of the reason I am working on this is it is a great and simple
case for learning. I have never had to apply such a combination in such a
simple case.
> I was putting the quote characters within the lexer rules so that this
> wouldn't happen for the ID. For the string literal, if the whitespace is
> captured separately from the string literal, then the whitespace won't be a
> part of the string literal, which wouldn't be right. Am I understanding
> things correctly here?
>
>
Sounds right to me.
>
>
> Thanks a lot!
> Chuck
>
> On Sun, Mar 18, 2012 at 5:19 PM, Eric <researcher0x00 at gmail.com> wrote:
>
>> Hi Chuck,
>>
>> Opps. I just realized that you probably want to have spaces in your
>> StringLiteral and the grammar I just gave you doesn't allow that.
>>
>> I look at it some more.
>>
>> Eric
>>
>>
>>
>> On Sun, Mar 18, 2012 at 4:48 PM, Eric <researcher0x00 at gmail.com> wrote:
>>
>>> Hi Chuck,
>>>
>>> The below grammar worked for me for "test" Integer "01" and some other
>>> basic test
>>>
>>> Be careful with the grammar; it can easily cause a new person lots of
>>> problems. The main reason is that you have
>>> 1. Keywords
>>> 2. Identifiers
>>> 3. String Literals
>>> 4. Whitespace
>>> which are all either subsets or a partial set of another. I spent 90% of
>>> my time setting up the rules to keep them corralled and in the right order.
>>>
>>> The main changes I made were
>>> 1. pulled all of the string literals out of the parser rules
>>> 2. Used Ters example for keywords See:
>>> http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741
>>> 3. Created a TYPE lexer rule so that the types wouldn't become ID.
>>> 4. Changed the WS rule, mostly added +
>>> 5. Pulled the quotes out as a separate token
>>> 6. Moved UnquotedString to be the last rule since it tries to consume
>>> nearly everything.
>>> 7. Added space and tab to the negation rule for UnquotedString. I avoid
>>> negation in lexer rules like the plague, they always lead to a problem. The
>>> UnquotedString rule can become the death of you if you don't respect it.
>>>
>>> Also using ANTLRWorks "Show Input Tokens" under the run menu revealed
>>> that the space at the end of the type and before the quote was not being
>>> pulled out as a WS token and that was causing a big problem.
>>>
>>> Enjoy, Eric
>>>
>>>
>>>
>>> grammar Chuck001;
>>> // Parser Rles
>>> triplet : Quote ID Quote type Quote UnquotedString Quote ;
>>>
>>> type : keyINTEGER
>>> | keyBOOLEAN
>>> | keySTRING
>>> ;
>>>
>>> keyBOOLEAN : {input.LT(1).getText().equals("Boolean")}? TYPE;
>>> keyINTEGER : {input.LT(1).getText().equals("Integer")}? TYPE;
>>> keySTRING : {input.LT(1).getText().equals("String")}? TYPE;
>>>
>>>
>>> // Lexer Rules
>>> Quote : '"';
>>> TYPE : ('A'..'Z')('a'..'z')*
>>> ;
>>> ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>>> ;
>>>
>>> COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>>> | '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>>> ;
>>> WS :
>>> ( '\t'
>>> | ' '
>>> | '\r'
>>> | '\n'
>>> | '\u000C'
>>> ) + { $channel = HIDDEN; }
>>> ;
>>> fragment
>>> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
>>> fragment
>>> ESC_SEQ
>>> : '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>>> | UNICODE_ESC
>>> | OCTAL_ESC
>>> ;
>>> fragment
>>> OCTAL_ESC
>>> : '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>>> | '\\' ('0'..'7') ('0'..'7')
>>> | '\\' ('0'..'7')
>>> ;
>>> fragment
>>> UNICODE_ESC
>>> : '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>>> ;
>>>
>>> UnquotedString
>>> : ( ESC_SEQ | ~('\\'|'"'|' '|'\t') )*
>>> ;
>>>
>>> On Sun, Mar 18, 2012 at 3:01 PM, Charles Daniels <cjdaniels4 at gmail.com>wrote:
>>>
>>>> Hi Eric,
>>>>
>>>> Thanks for the quick response. I have downloaded ANTLRWorks 1.4.2 and
>>>> created a fresh grammar using some of the defaults that the tool generates.
>>>> Below is my grammar.
>>>>
>>>> This grammar successfully parses the following input:
>>>>
>>>> name String "value"
>>>>
>>>>
>>>> However, I want to modify this grammar so that it will successfully
>>>> parse the following input instead:
>>>>
>>>> "name" String "value"
>>>>
>>>>
>>>> In attempting to do this, I modified the grammar below by adding double
>>>> quotes around ID, like so:
>>>>
>>>> ID : '"' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* '"'
>>>> ;
>>>>
>>>>
>>>> However, parsing fails (MissingTokenException) for the desired input.
>>>> I'm guessing that's because "value" is matched to ID rather than to STRING,
>>>> when I add the quotes around ID.
>>>>
>>>> Is there any way to get "value" to match STRING instead of matching ID
>>>> when I add quotes to ID? Will backtracking help? If so, how would I specify
>>>> it?
>>>>
>>>> Thanks,
>>>> Chuck
>>>>
>>>> --- BEGIN GRAMMAR ---
>>>> grammar Config;
>>>>
>>>> triplet : ID type STRING
>>>> ;
>>>> type : 'Boolean' | 'Integer' | 'String'
>>>> ;
>>>>
>>>> ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>>>> ;
>>>>
>>>> COMMENT
>>>> : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>>>> | '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>>>> ;
>>>>
>>>> WS : ( ' '
>>>> | '\t'
>>>> | '\r'
>>>> | '\n'
>>>> ) {$channel=HIDDEN;}
>>>> ;
>>>>
>>>> STRING
>>>> : '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
>>>> ;
>>>>
>>>> fragment
>>>> HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
>>>>
>>>> fragment
>>>> ESC_SEQ
>>>> : '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
>>>> | UNICODE_ESC
>>>> | OCTAL_ESC
>>>> ;
>>>>
>>>> fragment
>>>> OCTAL_ESC
>>>> : '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>>>> | '\\' ('0'..'7') ('0'..'7')
>>>> | '\\' ('0'..'7')
>>>> ;
>>>>
>>>> fragment
>>>> UNICODE_ESC
>>>> : '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
>>>> ;
>>>> --- END GRAMMAR ---
>>>>
>>>>
>>>> On Sun, Mar 18, 2012 at 12:27 PM, Eric <researcher0x00 at gmail.com>wrote:
>>>>
>>>>> Hi Chuck,
>>>>>
>>>>> Off the top of my head I would guess that STRINGLITERAL is trumping
>>>>> IDENTIFIER. In other works the lexer generates tokens. The tokens are
>>>>> created based on the rules in the lexer. Since STRINGLITERAL comes before
>>>>> IDENTIFIER, anything that matches STRINGLITERAL will make a
>>>>> STRINGLITERAL token even if STRINGLITERAL defines the same character
>>>>> string patterns as IDENTIFIER, i.e. '"' ( ~('\\'|'"') )* '"' trumps '"'
>>>>> IdentifierStart IdentifierPart* '"'
>>>>>
>>>>> Can you post your full grammar. I am having to guess at (copied from
>>>>> Java.g) and believe I have something different.
>>>>>
>>>>> Also I strongly recommend using ANTLRWorks 1.4.2 for a new user. Note
>>>>> this is not the latest version of ANTLRWorks which is 1.4.3. I am not
>>>>> recommending ANTLRWorks 1.4.3 because it can not draw the NFA and DFA
>>>>> diagrams due to a bug and when I new user tries to do this and it doesn't
>>>>> work they think they did something wrong when if fact it is a bug from
>>>>> ANTLR 3.4 which is used by ANTLRWorks 1.4.3.
>>>>>
>>>>> Also, you can search previous post to the list by using
>>>>> http://antlr.markmail.org/
>>>>>
>>>>> Hope that helps, Eric
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Mar 18, 2012 at 11:22 AM, Charles Daniels <
>>>>> cjdaniels4 at gmail.com> wrote:
>>>>>
>>>>>> I am very new to ANTLR and I having trouble properly defining part
>>>>>> of a
>>>>>> grammar that I am constructing to recognize a specialized
>>>>>> configuration
>>>>>> file syntax (already defined, so I cannot change it).
>>>>>>
>>>>>> The trouble I am having is recognizing the following type of entry in
>>>>>> the
>>>>>> config file:
>>>>>>
>>>>>> "name" type "value"
>>>>>>
>>>>>>
>>>>>> where the following rules apply:
>>>>>>
>>>>>> 1. The double quotes are a required part of the syntax, both for the
>>>>>> name and the value.
>>>>>> 2. A "name" is essentially a Java identifier
>>>>>> 3. A "value" is a string literal
>>>>>>
>>>>>>
>>>>>> I have the following grammar so far:
>>>>>>
>>>>>> triplet : IDENTIFIER type STRINGLITERAL ;
>>>>>>
>>>>>> type : 'Boolean' | 'Integer' | 'String' ;
>>>>>>
>>>>>> STRINGLITERAL : (copied from Java.g)
>>>>>>
>>>>>> IDENTIFIER : '"' IdentifierStart IdentifierPart* '"' ;
>>>>>>
>>>>>> IdentifierStart : (copied from Java.g)
>>>>>>
>>>>>> IdentifierPart : (copied from Java.g)
>>>>>>
>>>>>> When I compile this grammar, ANTLR hangs. When I remove the double
>>>>>> quotes
>>>>>> from IDENTIFIER, it compiles successfully. I am guessing that when I
>>>>>> include the double quotes in IDENTIFIER they are somehow causing the
>>>>>> compilation to hang due to the double quotes that are also in
>>>>>> STRINGLITERAL.
>>>>>>
>>>>>> Does anybody have any suggestions on how to define this so that I can
>>>>>> use
>>>>>> double quotes around names and values and the compiler doesn't hang?
>>>>>>
>>>>>> Thanks,
>>>>>> Chuck
>>>>>>
>>>>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>>>>> Unsubscribe:
>>>>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
More information about the antlr-interest
mailing list