[antlr-interest] antlr v4 wish list

Sat Mar 26 10:21:24 PDT 2011

Re: Scanner-less parsing

The Parse::RecDescent module for Perl5 implements parsers without a 
separate scanner and is what comes to mind when I hear the phrase 
scanner-less. If you were to retain a scanner, I think the 
characteristic that could provide the same function is to provide 
context to the scanner so that when you go to get the next token, the 
scanner only considers the type of token next expected in the current 
alternative in the production. This way one could have multiple tokens 
that might all match some text (but not others) and have the context of 
the production resolve which one it was, (so long as it matched, of course.)

For instance, I might want to have separate token types for binary, hex 
and decimal digits, but a scanner can not tell which of the three it is 
if the input is '0' or '1'. Hex overlaps with decimal for 0-9 and 
overlaps with binary for 0-1 and potentially 'x', 'X', 'z' and 'Z' for 
some implementations. There absolutely are other ways to handle this, 
but there is a great deal of flexibility that comes from permitting 
context to guide the scanner.

Having to work through the unambiguity of lexer patterns was something 
that was unexpected when I recently began working with ANTLR. I suspect 
that this would not be the case for someone who is more accustomed to 
using Lex/Yacc or comes from a more traditional or academic 
parser-building background.

Best regards,
Jason Doege

On 3/25/2011 9:19 AM, The Researcher wrote:
> 
>
> On Thu, Mar 24, 2011 at 2:32 PM, The Researcher<researcher0x00 at gmail.com>wrote:
>
>>
>> On Thu, Mar 24, 2011 at 1:23 PM, Terence Parr<parrt at cs.usfca.edu>  wrote:
>>
>>> added
>>>
>>> * Tree parser error handling should skip subtrees not nodes; these are
>>> programming errors not input errors.  The flat stream makes it hard to
>>> resync.
>>>
>>> Ter
>>>   On Mar 24, 2011, at 2:07 AM, Iztok Kavkler wrote:
>>>
>>>>> Howdy, I'm going to start augmenting ANTLR v3 significantly to create
>>> v4. The goal is backward compatibility; any new functionality, of course,
>>> will require altering or augmenting your grammars to take advantage of it.
>>> Here is my potential list of updates:
>>>>> http://www.antlr.org/wiki/display/ANTLR4/ANTLR+v4+Wish+list
>>>>>
>>>>> Anything to add or comment on?
>>>>>
>>>>> Ter
>>>>>
>>>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>>>> Unsubscribe:
>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>> A new error recovery mode for tree parsing:
>>>> When parsing ASTs, the ordinary error recovery strategies based on token
>>>> deletion/insertion are completely useless, because there are no man-made
>>>> syntax errors. In my experience, what you really want to do is the
>>>> following: assume that you have an error handler attached to some rule
>>>> and an error happens somewhere in the subtree of the node parsed by that
>>>> rule. When the handler catches an error, the parser must skip the
>>>> remainder of that subtree, otherwise the parser position is not
>>>> consistent with the grammar position anymore. In AST implementations
>>>> that are based on pointers between nodes this happens automatically, but
>>>> Antlr's representation as a flat list of nodes with UP and DOWN tokens
>>>> makes it requires some work - the parser has to keep track of the
>>>> current node's depth and skip the appropriate number of UP nodes
>>>> whenever an error is caught.
>>>>
>>>> Iztok
>>>>
>>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>>> Unsubscribe:
>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>
>>>
>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe:
>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>
>>
>> 1. If my concept of scannerless parsing is the same as yours, then in the
>> generated code for a rule allow the true for "do {<rule code>  }
>> while(true)" to be an attribute of the rule, i.e exit. Obviously the value
>> would be true unless changed by a user.This would allow the user to have
>> control of when to exit the rule. By turning true into a attribute of the
>> rule, this allows for more control than gated semantic predicates.
>>
>> Based on by concept of scannerless parsing, there is no lexer and the
>> parser drives the reading of the tokens from the intput stream. The input
>> stream does not generate the tokens ahead of time but only when needed. In a
>> quick proof of concept I had the token type passed from the parser as a
>> generic parameter, allowing the redefinition of the token returned by the
>> token stream. There were no pre-defined tokens values; they were dynamically
>> generated.To get the proof of concept to work required having a
>> cross-reference table between token types and token values.
>>
>> 2. If ANTLR 4 will allow the reading of binary data streams, then please
>> don't put char and line pos in a base class. There could be one inherited
>> classes that defines line and char pos, and another inherited class that
>> defines offset.
>>
>> Thanks
>>
>> Eric
>>
>>
> After finding Scannerless Generalized LR (SGLR),  which I believe is closer
> to your meaning, my concept of scannerless parsing is different enough that
> the reference should should be disregarded. I still submit the request for a
> rule to have an exit attribute.
>
> Thanks, Eric
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>