[antlr-interest] Mismatched token problem

Fri Jan 16 16:37:52 PST 2009

Well, I've decided to punt on this issue for now.  I'd like to solve
it such that I can have antlr parse the nth expression, but just can't
seem to wrap my head around how.  For now I'm just going to grab the
text of what's in the parens and parse it out with regex.  It's not
pretty and I'd much rather figure out the right way to do it in antlr,
but right now I just want something that works.  If anyone wants to
take a look at the grammar at
<http://code.google.com/p/cssselectors/source/browse/trunk/src/main/antlr/com/threelevers/css/CssSelectors.g>
and offer suggestions for improvement on this issue and anything else
you see, I'd greatly appreciate it.

Thanks again for the help and the great tool!

Rich

On Wed, Jan 14, 2009 at 7:19 PM, Richard Wallace
<rwallace at thewallacepack.net> wrote:
> So, rather than continuing to talk about it all in an abstract way and
> showing you just bits I threw up the project I'm working on on Google
> Code <http://code.google.com/p/cssselectors/>.  It's a library for
> using CSS selectors to get elements out of XML documents.  I'm hoping
> to be able to use it in integration tests of web applications rather
> than having to use XPath which I've never really liked.  The ANTLR
> grammar can be found at
> <http://code.google.com/p/cssselectors/source/browse/trunk/src/main/antlr/com/threelevers/css/CssSelectors.g>.
>
> On Wed, Jan 14, 2009 at 4:51 PM, Kevin J. Cummings
> <cummings at kjchome.homeip.net> wrote:
>> Richard Wallace wrote:
>>
>>> Ok, I'm feeling really dense right now.  I put the rules in as follows:
>>>
>>> fragment IDENTFRAGMENT
>>>    : ('_' | 'a'..'z'| 'A'..'Z' | '\u0100'..'\ufffe' )
>>>    ;
>>>
>>> fragment IDENTNUMFRAGMENT
>>>    : IDENTFRAGMENT | '0' .. '9'
>>>    ;
>>>
>>> IDENT
>>>    : IDENTFRAGMENT ( DASH | IDENTNUMFRAGMENT )*
>>>    ;
>>>
>>> DASH
>>>    : '-' ( options{greedy=true;} : IDENTFRAGMENT { $type = IDENT; } )?
>>>    ;
>>>
>>> And I even understand what it means (I think), but I'm still running
>>> into the problem that in the expression 4n-1, n-1 is still being
>>> considered an expression.  I had to change protected to fragment to
>>
>> Sorry I thought you were using Antlr 2.7.7, that must of been someone else I
>> was chatting with, yes, fragment is correct for Antlr 3.x
>>
>>> get the lexer to not try and match 4 as a IDENTNUMFRAGMENT and the
>>> IDENT rule to match the language.  But I don't think that should cause
>>> this not to work, should it?  I must be missing something.  Any ideas?
>>
>> In your expr rule you specify S* as possible whitespace seperators. Also, if
>> you need to match n-1 as an IDENT, then its possible that you need do
>> another fragment to catch the 'n' and what follows as an IDENT.
>>
>
> Sorry, in this case I don't want n-1 to be an IDENT.  It should be in
> most cases, but in this case, when inside a :nth-child() function it
> shouldn't be considered an IDENT.  In CSS it is perfectly valid to
> have something like
>    #n-1
> where n-1 is the id of the element we want to find.
>
> The reason I include whitespace explicitly in some places rather than
> ignoring it is because it is important in one context in CSS.  In the
> selector
>    #a .b
> the space between the #a and #b is significant because it indicates
> that we are looking for an element with a class of "b" that is a
> descendant of an element with an id of "a".  I couldn't figure out a
> way to make the spaces everywhere else be ignored but still have this
> one be recognized properly.  If the space isn't recognized properly,
> "#a .b" is treated the same as "#a.b" which has a completely different
> meaning.
>
>> By default, ANTLR does greedy matching of tokens. In other words, it tries
>> to match as much as possible based on your rules.  It also tokenizes before
>> it parses.  So, if you don't want 4n-1 to be NUMBER IDENT, then you need to
>> have a lexer rule to catch something different.  Does it help if you try a
>> lexer rule that catches NUMBER 'n' as a TOKEN? and then use *that* in your
>> expr rule?
>>
>
> I'm not sure exactly what you mean here.  I've looked at a bunch of
> examples and can't figure it out.  I tried adding a
>
> tokens {
>    MAGN;
> }
>
> but then I'm not sure where to put the lexer rule.  I tried
>
> ATERM : ( NUMBER? 'n' ) -> MAGN
>
> but ANTLR claims MAGN is an unexpected token so obviously I'm doing
> something wrong.
>
>> Also, when I code expression parsers that don't care about whitespace, I
>> just set whitespace to be ignored in the lexer.  ANTLR will still stop
>> lexing tokens when it finds a whitespace.  So, in general, I never reference
>> whitespace in the parser.  You need to fix your token stream so that the
>> parser does the right thing with what it finds.
>>
>> Make a lexer rule for:  DASH? NUMBER? 'n'  Or maybe just for NUMBER 'n'
>>
>
> I tried a rule called ATERM that looked like
>
> ATERM : DASH? NUMBER? 'n' ;
>
> and tried putting that in the nth_child_expr as
>
> nth_child_expr : ATERM S* ('+' | DASH) S* NUMBER and that didn't help either.
>
>> Sorry for being vague, but I hope its helpful.
>>
>
> Hopefully, now that my full grammar is out there you can take a better
> look at it and see what's going on.  I appreciate all the help, it's
> been really valuable and I'm learning a lot (mostly how much I have to
> learn about antlr ;)).
>
>>> Rich
>>
>> --
>> Kevin J. Cummings
>> kjchome at rcn.com
>> cummings at kjchome.homeip.net
>> cummings at kjc386.framingham.ma.us
>> Registered Linux User #1232 (http://counter.li.org)
>>
>