[antlr-interest] Learning the basics of ANTLR

Tue Oct 13 12:21:53 PDT 2009

Evan;

(Please also reply to the list so others can help as well.)

I think the problem here may be that the xmldecl parser rule is looking for explicit characters, whereas the lexer has absorbed these into a single GENERIC_ID.  Take a look how other grammars handle keywords vs ids, and also search for info on case insensitivity, which is a special problem. 

-- Graham

At 10/13/2009 11:50 AM, Evan Metheny wrote:
>INPUT
>------------------------------------------------------------------------
><?xml version='1.0'?>
><!DOCTYPE component [
><!ELEMENT component (PCDATA|sub)*>
><!ATTLIST component
>          attr CDATA #IMPLIED
>          attr2 CDATA #IMPLIED
>>
><!ELMENT sub EMPTY>
>
>]>
><component attr="val'ue" attr2='val"ue'>
><!-- This is a comment -->
>Text
><![CDATA[huhu]]>
>öäüß
>&amp;
>&lt;
><?xtal cursor='11'?>
><sub/>
><sub></sub>
></component>
>
>
>
>
>XML.g
>--------------------------------
>
>grammar XML;
>
>options {
>backtrack = true;
>}
>
>document
>	:	xmldecl WS? doctype
>	;
>
>doctype
>    :
>        '<!DOCTYPE' WS? GENERIC_ID
>
>        WS?
>        (
>            ( 'SYSTEM' WS? VALUE
>
>
>            | 'PUBLIC' WS? VALUE WS? VALUE
>
>
>            )
>            ( WS )?
>        )?
>        ( INTERNAL_DTD
>
>        )?
>		'>'
>	;
>
>INTERNAL_DTD : '[' (options {greedy=false;} : .)* ']' ;
>
>pi :
>        '<?' GENERIC_ID WS?
>
>        ( attribute WS? )*  '?>'
>	;
>
>xmldecl :
>        '<?' ('x'|'X') ('m'|'M') ('l'|'L') WS?
>
>        ( attribute WS? )*  '?>'
>	;
>
>
>element
>    : ( start_tag
>            (element
>            | PCDATA
>
>            | cdata
>
>            | comment
>
>            | pi
>            )*
>            end_tag
>        | emptyelement
>        )
>    ;
>
>start_tag
>    : '<' WS? GENERIC_ID WS?
>
>        ( attribute WS? )* '>'
>    ;
>
>emptyelement
>    : '<' WS? GENERIC_ID WS?
>
>        ( attribute WS? )* '/>'
>    ;
>
>attribute
>    : GENERIC_ID WS? '=' WS? VALUE
>
>    ;
>
>end_tag
>    : '</' WS? GENERIC_ID WS? '>'
>
>    ;
>
>comment
>	:	'<!--' (options {greedy=false;} : .)* '-->'
>	;
>
>cdata
>	:	'<![CDATA[' (options {greedy=false;} : .)* ']]>'
>	;
>
>
>
>GENERIC_ID
>    : ( LETTER | '_' | ':')
>        ( options {greedy=true;} :
>        LETTER | '0'..'9' | '.' | '-' | '_' | ':' )*
>	;
>
>fragment LETTER
>	: 'a'..'z'
>	| 'A'..'Z'
>	;
>
>
> WS  :
>        (   ' '
>        |   '\t'
>        |  ( '\n'
>            |	'\r\n'
>            |	'\r'
>            )
>        )+
>    ;
>
>fragment PCDATA : (~'<')+ ;
>
>fragment VALUE :
>        ( '\"' (~'\"')* '\"'
>        | '\'' (~'\'')* '\''
>        )
>	;
>
>
>On Tue, Oct 13, 2009 at 11:43 AM, Graham Wideman
><gwlist at grahamwideman.com> wrote:
>> You haven't shown your revised grammar. However, in the old grammar the 
>xmldecl rule says that an attribute is required, so feeding in just:
>>
>> xml
>>
>> will not satisfy xmldecl.
>>
>> -- Graham
>>
>> At 10/13/2009 11:24 AM, Evan Metheny wrote:
>>>Thanks for the response Graham
>>>
>>>> Fragments can only be part of another lexer rule, they are not stand-alone
>>>token-producing lexer rules.  Henxe the missing token exception.
>>>
>>>OK that makes sense, when changing fragment GENERIC_ID to GENERIC_ID.
>>>The "xmldecl" rule breaks with "mismatched set exception" when trying
>>>to recognize "xml". Thats what i was trying to explain with:
>>>
>>>>
>>>> Also:
>>>>> I cant
>>>>>understand why it would break the recognition of "XML" when its before
>>>>>the attribute call.
>>>>
>>>> So far as I know, there is no impact of order in which the lexer and
>>>parser rules appear in the .g file.
>>>
>>>Thanks
>>>
>>>-Evan
>>
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: 
>http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>