[antlr-interest] Learning the basics of ANTLR
Graham Wideman
gwlist at grahamwideman.com
Tue Oct 13 12:21:53 PDT 2009
Evan;
(Please also reply to the list so others can help as well.)
I think the problem here may be that the xmldecl parser rule is looking for explicit characters, whereas the lexer has absorbed these into a single GENERIC_ID. Take a look how other grammars handle keywords vs ids, and also search for info on case insensitivity, which is a special problem.
-- Graham
At 10/13/2009 11:50 AM, Evan Metheny wrote:
>INPUT
>------------------------------------------------------------------------
><?xml version='1.0'?>
><!DOCTYPE component [
><!ELEMENT component (PCDATA|sub)*>
><!ATTLIST component
> attr CDATA #IMPLIED
> attr2 CDATA #IMPLIED
>>
><!ELMENT sub EMPTY>
>
>]>
><component attr="val'ue" attr2='val"ue'>
><!-- This is a comment -->
>Text
><![CDATA[huhu]]>
>öäüß
>&
><
><?xtal cursor='11'?>
><sub/>
><sub></sub>
></component>
>
>
>
>
>XML.g
>--------------------------------
>
>grammar XML;
>
>options {
>backtrack = true;
>}
>
>document
> : xmldecl WS? doctype
> ;
>
>doctype
> :
> '<!DOCTYPE' WS? GENERIC_ID
>
> WS?
> (
> ( 'SYSTEM' WS? VALUE
>
>
> | 'PUBLIC' WS? VALUE WS? VALUE
>
>
> )
> ( WS )?
> )?
> ( INTERNAL_DTD
>
> )?
> '>'
> ;
>
>INTERNAL_DTD : '[' (options {greedy=false;} : .)* ']' ;
>
>pi :
> '<?' GENERIC_ID WS?
>
> ( attribute WS? )* '?>'
> ;
>
>xmldecl :
> '<?' ('x'|'X') ('m'|'M') ('l'|'L') WS?
>
> ( attribute WS? )* '?>'
> ;
>
>
>element
> : ( start_tag
> (element
> | PCDATA
>
> | cdata
>
> | comment
>
> | pi
> )*
> end_tag
> | emptyelement
> )
> ;
>
>start_tag
> : '<' WS? GENERIC_ID WS?
>
> ( attribute WS? )* '>'
> ;
>
>emptyelement
> : '<' WS? GENERIC_ID WS?
>
> ( attribute WS? )* '/>'
> ;
>
>attribute
> : GENERIC_ID WS? '=' WS? VALUE
>
> ;
>
>end_tag
> : '</' WS? GENERIC_ID WS? '>'
>
> ;
>
>comment
> : '<!--' (options {greedy=false;} : .)* '-->'
> ;
>
>cdata
> : '<![CDATA[' (options {greedy=false;} : .)* ']]>'
> ;
>
>
>
>GENERIC_ID
> : ( LETTER | '_' | ':')
> ( options {greedy=true;} :
> LETTER | '0'..'9' | '.' | '-' | '_' | ':' )*
> ;
>
>fragment LETTER
> : 'a'..'z'
> | 'A'..'Z'
> ;
>
>
> WS :
> ( ' '
> | '\t'
> | ( '\n'
> | '\r\n'
> | '\r'
> )
> )+
> ;
>
>fragment PCDATA : (~'<')+ ;
>
>fragment VALUE :
> ( '\"' (~'\"')* '\"'
> | '\'' (~'\'')* '\''
> )
> ;
>
>
>On Tue, Oct 13, 2009 at 11:43 AM, Graham Wideman
><gwlist at grahamwideman.com> wrote:
>> You haven't shown your revised grammar. However, in the old grammar the
>xmldecl rule says that an attribute is required, so feeding in just:
>>
>> xml
>>
>> will not satisfy xmldecl.
>>
>> -- Graham
>>
>> At 10/13/2009 11:24 AM, Evan Metheny wrote:
>>>Thanks for the response Graham
>>>
>>>> Fragments can only be part of another lexer rule, they are not stand-alone
>>>token-producing lexer rules. Henxe the missing token exception.
>>>
>>>OK that makes sense, when changing fragment GENERIC_ID to GENERIC_ID.
>>>The "xmldecl" rule breaks with "mismatched set exception" when trying
>>>to recognize "xml". Thats what i was trying to explain with:
>>>
>>>>
>>>> Also:
>>>>> I cant
>>>>>understand why it would break the recognition of "XML" when its before
>>>>>the attribute call.
>>>>
>>>> So far as I know, there is no impact of order in which the lexer and
>>>parser rules appear in the .g file.
>>>
>>>Thanks
>>>
>>>-Evan
>>
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:
>http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
More information about the antlr-interest
mailing list