[antlr-interest] Re: Antlr 3.0 spaces between tokens

Terence Parr parrt at cs.usfca.edu
Thu Nov 11 08:00:17 PST 2004



On Nov 11, 2004, at 2:25 AM, matthew ford wrote:

>
> Yes I did something much the same
>
> But what about a  SKIP  in the parser side?
> I cannot see any conceptual problem providing it.

None.  Mitchell yelled at me until I put it in.  Already working great.  
  new mechanism has all tokens coming to parser.  you can set channel  
via simple action in lexer.  Parser tunes to whatever channel it wants.  
  All tokens are always available and all char are always available in  
new scheme.  You can implement your own less-buffered stuff if you're  
reading 30G of data.  The default system will be very nice.  Default  
token match creates a token object, but does not char copy nor string  
alloc; just records start/stop index into char stream.  Trees will just  
point at a token; no copy there either. :)

Anyway, I'm off to Montreal to meet with Etienne Gagnon the SableCC  
guy; giving a talk tomorrow.  Racing off to the airport in a few  
minutes so I'm afraid I won't be able to contribute much until I get  
back on Sunday.  Damn, this discussion and the one on trees is getting  
interesting!  You can be sure I'll be figuring out his tree visitor  
generation stuff (like TreeDL?).

Woohoo!

Ter

>
> matthew
>
> ----- Original Message -----
> From: "lgcraymer" <lgc at mail1.jpl.nasa.gov>
> To: <antlr-interest at yahoogroups.com>
> Sent: Thursday, November 11, 2004 8:19 PM
> Subject: [antlr-interest] Re: Antlr 3.0 spaces between tokens
>
>
>>
>>
>> Context-dependent lexing is a nasty problem.  ANTLR 3 probably won't
>> solve it.  I ran into exactly the same problem in an expression
>> grammar for spacecraft sequencing.  The cleanest approach I could come
>> up with was to have a counter that was incremented by LBRACKET and
>> decremented by RBRACKET.  If the counter was zero, then whitespace
>> tokens were marked "SKIP"; if it was positive, then they were "WS" and
>> recognized by the parser.  That helped simplify the grammar.
>>
>> --Loring
>>
>>
>> --- In antlr-interest at yahoogroups.com, "matthew ford"
>> <Matthew.Ford at f...> wrote:
>>> That is what I am talking about
>>> whitespace as a syntax feature and not just a token separator.
>>> This is usually only in a small number of rules
>>> One example I had was a math language where whitespace was  
>>> significant
>>> inside
>>> [ ] when indexing matrices but elsewhere it was just a token  
>>> separator.
>>>
>>> matthew
>>>
>>> ----- Original Message -----
>>> From: "lgcraymer" <lgc at m...>
>>> To: <antlr-interest at yahoogroups.com>
>>> Sent: Thursday, November 11, 2004 6:13 PM
>>> Subject: [antlr-interest] Re: Antlr 3.0 spaces between tokens
>>>
>>>
>>>>
>>>>
>>>> As usual--you ignore whitespace during parsing.  Then when you need
>>>> the whitespace around a token, you peek into the token stream around
>>>> the point of interest.  It doesn't help if whitespace is really a
>>>> syntax feature and not just a token separator.
>>>>
>>>> --Loring
>>>>
>>>>
>>>> --- In antlr-interest at yahoogroups.com, "matthew ford"
>>>> <Matthew.Ford at f...> wrote:
>>>>> A bit too clever for me  how do you write the parser rules?
>>>>> matthew
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "lgcraymer" <lgc at m...>
>>>>> To: <antlr-interest at yahoogroups.com>
>>>>> Sent: Thursday, November 11, 2004 5:51 PM
>>>>> Subject: [antlr-interest] Re: Antlr 3.0 spaces between tokens
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> The min/max of ASTMinMax gives you an index into the token stream.
>>>>>> Look for neighboring whitespace tokens.  By carrying the token
>> stream
>>>>>> index around, you carry around references to associated
>> whitespace.
>>>>>> It's a rather clever trick for solving the whitespace tracking
>>>> problem.
>>>>>>
>>>>>> --Loring
>>>>>>
>>>>>> --- In antlr-interest at yahoogroups.com, "matthew ford"
>>>>>> <Matthew.Ford at f...> wrote:
>>>>>>> Perhaps I am missing the point of the that article, but in
>> my case I
>>>>>> don't
>>>>>>> what to just keep the whitespace for printing.
>>>>>>>
>>>>>>> For some (not all) parser rules,  whitespace is actually
>> important
>>>>>> for the
>>>>>>> parsing.
>>>>>>> So I want the parser to see all the whitespace for some
>> rules and
>>>>>> not others
>>>>>>>
>>>>>>> So what I want is the Token.SKIP option on the parser side
>> instead
>>>>>> of on the
>>>>>>> lexer side and controlled on a rule basis.
>>>>>>>
>>>>>>> matthew
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>> From: "lgcraymer" <lgc at m...>
>>>>>>> To: <antlr-interest at yahoogroups.com>
>>>>>>> Sent: Thursday, November 11, 2004 5:32 PM
>>>>>>> Subject: [antlr-interest] Re: Antlr 3.0 spaces between tokens
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Take a look at
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> <http://www.antlr.org/article/preserving.token.order/ 
> preserving.token.order.
>>>>>>> tml>
>>>>>>>>
>>>>>>>> It's hard to see how ANTLR 3 could do better.
>>>>>>>>
>>>>>>>> --Loring
>>>>>>>>
>>>>>>>> --- In antlr-interest at yahoogroups.com, "matthew ford"
>>>>>>>> <Matthew.Ford at f...> wrote:
>>>>>>>>> Hi Ter,
>>>>>>>>>
>>>>>>>>> Perhaps for Antlr 3.0 we can have a better means of handling
>>>> white
>>>>>>>> space.
>>>>>>>>>
>>>>>>>>> Antlr provides an ignore whitespace capability that is
>> appealing
>>>>>>>>> WS : ( ' ' | '\t' | '\n' { newline(); } | '\r' )+
>>>>>>>>>      { $setType(Token.SKIP); }
>>>>>>>>>    ;but every time I try and use it I come across a
>>>> situation where
>>>>>>>> I really
>>>>>>>>> want/need the white space in the parser.
>>>>>>>>>
>>>>>>>>> So I end up having the lexer pass it back to the parser.
>>>>>>>>> (or have switch in the lexer that the parser uses to
>> control the
>>>>>>>> return of
>>>>>>>>> whitespace.  I know this is a no-no but it has worked for me
>>>> in some
>>>>>>>> cases)
>>>>>>>>>
>>>>>>>>> The parser usually only needs to know about the whitespace
>>>> in a few
>>>>>>>> rules
>>>>>>>>> but now has
>>>>>>>>> (WS)* all over the place to handle whitespace every where.
>>>>>>>>>
>>>>>>>>> Basically what I would like to have
>>>>>>>>>  the lexer pass all the whitespace back to the parser)
>> and then
>>>>>> in the
>>>>>>>>> parser be able to say
>>>>>>>>> a) for this rule ignore white space.
>>>>>>>>> or
>>>>>>>>> b) for this rule whitespace is important
>>>>>>>>>
>>>>>>>>> Actually the second option is more likely.
>>>>>>>>>
>>>>>>>>> matthew
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Monty Zukowski" <monty at c...>
>>>>>>>>> To: <antlr-interest at yahoogroups.com>
>>>>>>>>> Sent: Thursday, November 11, 2004 3:38 AM
>>>>>>>>> Subject: Re: [antlr-interest] spaces between tokens
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Nov 10, 2004, at 7:39 AM, Anakreon wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> silverio.di at q... wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> I've a big problem.
>>>>>>>>>>>>
>>>>>>>>>>>> In my grammar, how in many others, the whitespaces are
>>>>>> skipped in
>>>>>>>>>>>> lexer,
>>>>>>>>>>>> but I've some circumstances in which I need to
>> check that
>>>>>> not any
>>>>>>>>>>>> spaces
>>>>>>>>>>>> are present between tokens.
>>>>>>>>>>>>
>>>>>>>>>>>> Example :
>>>>>>>>>>>> WeekJobHour at Monday = 8
>>>>>>>>>>>>
>>>>>>>>>>>> would mean assign 8 (hours) to parameter Monday of
>>>> structure
>>>>>>>>>>>> WeekJobHour.
>>>>>>>>>>>> I would like my lexer extract following tokens:
>>>>>>>>>>>>
>>>>>>>>>>>> IDENT ATSIGN IDENT
>>>>>>>>>>>>
>>>>>>>>>>>> but my problem is to check than not any WS are present
>>>> between
>>>>>>>>>>>> IDENT and ATSIGN and between ATSIGN and IDENT so
>>>>>>>>>>>>
>>>>>>>>>>>> WeekJobHour at Monday = 8        // is OK
>>>>>>>>>>>> WeekJobHour @Monday = 8       // is BAD
>>>>>>>>>>>> WeekJobHour@ Monday = 8       // is BAD
>>>>>>>>>>>> WeekJobHour  @ Monday = 8           // is BAD too !
>>>>>>>>>>>>
>>>>>>>>>>>> I could use following lexer rule:
>>>>>>>>>>>>
>>>>>>>>>>>> STRUCT_PARAMETER
>>>>>>>>>>>>       :     ('A'..'Z' | 'a..z')+
>>>>>>>>>>>>             '@'
>>>>>>>>>>>>             ('A'..'Z' | 'a..z')+
>>>>>>>>>>>>       ;
>>>>>>>>>>>>
>>>>>>>>>>>> but in parser how can I extract the structure name
>>>>>> (WeekJobHour)
>>>>>>>>>>>> and the structure parameter (Monday) form
>> STRUCT_PARAMETER
>>>>>>>>>>>> token ?
>>>>>>>>>>>>
>>>>>>>>>>>> I think a similar issue is present in C/C++ structure
>>>> construct
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for your suggestions about
>>>>>>>>>>>> Silverio Diquigiovanni
>>>>>>>>>>> Make a class wich implements TokenStream wich uses the
>>>> Lexer.
>>>>>>>>>>> In the nextToken method, if the lexer returns a token of
>>>> type
>>>>>>>>>>> STRUCT_PARAM, split the token in 3 tokens where the
>> first
>>>>>> would be
>>>>>>>>>>> of type STRUCT_NAME the second STRUCT_AT and the third
>>>>>> STRUCT_DAY
>>>>>>>>>>> and the text of the tokens WeekJobHour, @, Monday
>>>> respectively.
>>>>>>>>>>> return the first token from the method and store the
>>>> other 2.
>>>>>>>>>>> In the next 2 calls of nextToken return the stored ones.
>>>>>>>>>>>
>>>>>>>>>>> Pass the implementor of TokenStream instead of your
>>>> Lexer to the
>>>>>>>>>>> parser.
>>>>>>>>>>>
>>>>>>>>>>> Anakreon
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I agree with the above approach, and also read my
>> ParserFilter
>>>>>>>> paper on
>>>>>>>>>> my website,
>> http://www.codetransform.com/filterexample.html
>>>>>>>>>>
>>>>>>>>>> I would recommend an alternative approach, which would
>> be to
>>>>>> not skip
>>>>>>>>>> whitespace in the lexer.  Instead, discard it in the
>> parser
>>>>>> filter.
>>>>>>>>>> That filter can still check that no whitespace occurs
>> before
>>>>>> or after
>>>>>>>>>> an @ between IDENTS.
>>>>>>>>>>
>>>>>>>>>> Alternately you could keep track of state in the
>> lexer.  Set a
>>>>>> boolean
>>>>>>>>>> variable in the makeToken() method if the token made
>> was WS.
>>>>>> To see
>>>>>>>>>> what is coming after, inspect LA(1).  Assuming @ is
>> not used
>>>>>> in any
>>>>>>>>>> other way, you would have a rule similar to this, where
>>>>>>>>>> previousWasWhitespace is the variable set in makeToken().
>>>>>>>>>>
>>>>>>>>>> AT: { !previousWasWhitespace && (LA(1)==' ' ||
>> LA(1)=='\t') }?
>>>>>> '@' ;
>>>>>>>>>>
>>>>>>>>>> Monty
>>>>>>>>>>
>>>>>>>>>> ANTLR & Java Consultant -- http://www.codetransform.com
>>>>>>>>>> ANSI C/GCC transformation toolkit --
>>>>>>>>>> http://www.codetransform.com/gcc.html
>>>>>>>>>> Embrace the Decay --
>>>>>> http://www.codetransform.com/EmbraceDecay.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yahoo! Groups Links
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Yahoo! Groups Links
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Yahoo! Groups Links
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Yahoo! Groups Links
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>>
>>
>>
>> Yahoo! Groups Links
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>





 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 





More information about the antlr-interest mailing list