[antlr-interest] Fwd: Mismatched token problem

Thu Jan 15 08:28:51 PST 2009

Just realized that when I hit "Reply" it sent the message to Kevin
instead of the list.  Can't we change the mailman configuration to set
the Reply-To: header in the email to always be to the mailing list?

Rich

---------- Forwarded message ----------
From: Richard Wallace <rwallace at thewallacepack.net>
Date: Wed, Jan 14, 2009 at 7:19 PM
Subject: Re: [antlr-interest] Mismatched token problem
To: "Kevin J. Cummings" <cummings at kjchome.homeip.net>

So, rather than continuing to talk about it all in an abstract way and
showing you just bits I threw up the project I'm working on on Google
Code <http://code.google.com/p/cssselectors/>.  It's a library for
using CSS selectors to get elements out of XML documents.  I'm hoping
to be able to use it in integration tests of web applications rather
than having to use XPath which I've never really liked.  The ANTLR
grammar can be found at
<http://code.google.com/p/cssselectors/source/browse/trunk/src/main/antlr/com/threelevers/css/CssSelectors.g>.

On Wed, Jan 14, 2009 at 4:51 PM, Kevin J. Cummings
<cummings at kjchome.homeip.net> wrote:
> Richard Wallace wrote:
>
>> Ok, I'm feeling really dense right now.  I put the rules in as follows:
>>
>> fragment IDENTFRAGMENT
>>    : ('_' | 'a'..'z'| 'A'..'Z' | '\u0100'..'\ufffe' )
>>    ;
>>
>> fragment IDENTNUMFRAGMENT
>>    : IDENTFRAGMENT | '0' .. '9'
>>    ;
>>
>> IDENT
>>    : IDENTFRAGMENT ( DASH | IDENTNUMFRAGMENT )*
>>    ;
>>
>> DASH
>>    : '-' ( options{greedy=true;} : IDENTFRAGMENT { $type = IDENT; } )?
>>    ;
>>
>> And I even understand what it means (I think), but I'm still running
>> into the problem that in the expression 4n-1, n-1 is still being
>> considered an expression.  I had to change protected to fragment to
>
> Sorry I thought you were using Antlr 2.7.7, that must of been someone else I
> was chatting with, yes, fragment is correct for Antlr 3.x
>
>> get the lexer to not try and match 4 as a IDENTNUMFRAGMENT and the
>> IDENT rule to match the language.  But I don't think that should cause
>> this not to work, should it?  I must be missing something.  Any ideas?
>
> In your expr rule you specify S* as possible whitespace seperators. Also, if
> you need to match n-1 as an IDENT, then its possible that you need do
> another fragment to catch the 'n' and what follows as an IDENT.
>

Sorry, in this case I don't want n-1 to be an IDENT.  It should be in
most cases, but in this case, when inside a :nth-child() function it
shouldn't be considered an IDENT.  In CSS it is perfectly valid to
have something like
   #n-1
where n-1 is the id of the element we want to find.

The reason I include whitespace explicitly in some places rather than
ignoring it is because it is important in one context in CSS.  In the
selector
   #a .b
the space between the #a and #b is significant because it indicates
that we are looking for an element with a class of "b" that is a
descendant of an element with an id of "a".  I couldn't figure out a
way to make the spaces everywhere else be ignored but still have this
one be recognized properly.  If the space isn't recognized properly,
"#a .b" is treated the same as "#a.b" which has a completely different
meaning.

> By default, ANTLR does greedy matching of tokens. In other words, it tries
> to match as much as possible based on your rules.  It also tokenizes before
> it parses.  So, if you don't want 4n-1 to be NUMBER IDENT, then you need to
> have a lexer rule to catch something different.  Does it help if you try a
> lexer rule that catches NUMBER 'n' as a TOKEN? and then use *that* in your
> expr rule?
>

I'm not sure exactly what you mean here.  I've looked at a bunch of
examples and can't figure it out.  I tried adding a

tokens {
   MAGN;
}

but then I'm not sure where to put the lexer rule.  I tried

ATERM : ( NUMBER? 'n' ) -> MAGN

but ANTLR claims MAGN is an unexpected token so obviously I'm doing
something wrong.

> Also, when I code expression parsers that don't care about whitespace, I
> just set whitespace to be ignored in the lexer.  ANTLR will still stop
> lexing tokens when it finds a whitespace.  So, in general, I never reference
> whitespace in the parser.  You need to fix your token stream so that the
> parser does the right thing with what it finds.
>
> Make a lexer rule for:  DASH? NUMBER? 'n'  Or maybe just for NUMBER 'n'
>

I tried a rule called ATERM that looked like

ATERM : DASH? NUMBER? 'n' ;

and tried putting that in the nth_child_expr as

nth_child_expr : ATERM S* ('+' | DASH) S* NUMBER and that didn't help either.

> Sorry for being vague, but I hope its helpful.
>

Hopefully, now that my full grammar is out there you can take a better
look at it and see what's going on.  I appreciate all the help, it's
been really valuable and I'm learning a lot (mostly how much I have to
learn about antlr ;)).

>> Rich
>
> --
> Kevin J. Cummings
> kjchome at rcn.com
> cummings at kjchome.homeip.net
> cummings at kjc386.framingham.ma.us
> Registered Linux User #1232 (http://counter.li.org)
>