[antlr-interest] JavaScript grammar

Tue Apr 1 07:41:27 PDT 2008

Patrick,

Thanks for the suggestion. I'll give it a try. Could you please elaborate on
how to ensure that there are no line terminators in those rules which
disallow them? It's a bit disappointing that these kinds of manipulation tie
the grammar to a specific language, i.e. Java, but I think that's
unavoidable if I want to eliminate the excessive LT!* elements from all of
the rules. It's a simple trade-off, I guess.

What problem are you having with unicode identifiers?  I don't think there
was any problem with Greg's original grammar (apart from the ANTLRWorks
debugger issue I mentioned). As for regular expression literals, I'm
inclined to simply treat them as separate Regex tokens without any further
treatment, and leave their analysis to a separate grammar. Interestingly
enough, whilst the ECMAScript spec has a whole section on the composition of
regular expression literals, it doesn't appear to incorporate them into the
rest of the grammar - not that I could see, anyway. I think they can be
included as an alternative in the literal rule, which then becomes

literal : 'null' | 'true' | 'false' | StringLiteral | NumericLiteral | Regex

Regards,

Chris

On 31/03/2008, Patrick Hulsmeijer <phulsmeijer at xebic.com> wrote:
>
>  Chris,
>
>
>
> I've written an complete ECMAScript compiant grammar recently.
>
> I've struggled with the following issues:
>
> -          Unicode identifiers
>
> -          Regular expression literals
>
> -          Semicolon insertion
>
>
>
> For the semicolon insertion issue I've taken the following approach.  The
> line terminators are left on the hidden channel. In the parser the semicolon
> (e.g. in statements) is defined as a rule. In this rule I scan the token
> stream for line terminators and promote the first encountered to the default
> channel. Line terminator are also an alternative in this rule. Something
> like this:
>
>
>
> semic:
>
> @init
>
> {
>
>                 int marker = input.mark();
>
>                 promoteEOL();
>
> }
>
>                 : SEMIC
>
>                 | EOL
>
>                 | RBRACE { input.rewind(marker); }
>
>                 | EOF
>
>                 ;
>
>
>
> The grammar also scans for line terminators in places where the
> specification states there are none allowed (e.g. between "return" and the
> following optional expression in the return statement).
>
>
>
> Hope this helps.
>
>
>
> Regards,
>
> patrick.
>
>
>
> *From:* antlr-interest-bounces at antlr.org [mailto:
> antlr-interest-bounces at antlr.org] *On Behalf Of *Chris Lambrou
> *Sent:* zondag 30 maart 2008 7:43
> *To:* antlr-interest at antlr.org
> *Subject:* Re: [antlr-interest] JavaScript grammar
>
>
>
>
> The approach I took was to keep LT tokens on the default channel, for the
> benefit of the few rules where line terminators are important. For example,
> the ECMAScript spec defines a return statement as follows:
>
> 'return' [no LineTerminator here] Expression[optional] ';'
>
> In the ANTLR grammar, this becomes:
>
> returnStatement : 'return' expression (LT | ';')! ;
>
> You can see that the statement may end either with a semicolon or with a
> line terminator, which the ECMAScript spec permits. This appears to work
> just fine. The only part I find annoying is that because the LT tokens are
> not on the hidden channel, all of the other rules need to deal with them
> too. For example, the ifStatement rule looks like this:
>
> ifStatement : 'if' LT!* '(' LT!* expression LT!* ')' LT!* statement (LT!*
> 'else' LT!* statement)? ;
>
> when it would be much clearer if it looked like this:
>
> ifStatement : 'if' '(' expression ')' statement ('else' statement)? ;
>
> This affects all of the parser rules, which makes the grammar less
> readable. I did think about performing some filtering of the token stream
> between the lexing and parsing phases, but the rules for automatic semicolon
> insertion defined by the ECMAScript are a bit nasty. It's just too difficult
> to actually determine where the virtual semicolons should be, without the
> grammatical context that only the parsing stage can provide. In any case, I
> think the grammar would be less useful if it required any special runtime
> tweaks to make it work.
>
> I think what would really solve the problem would be to have the LT tokens
> on the hidden channel by default, and then dynamically switch them to the
> default channel only for those rules that require it. I'm afraid I'm not yet
> familiar enough with channels and token streams to know if this is even
> possible. My guess is that it probably isn't, but it can't hurt to ask.
>
> Chris
>
> P.S. I've just remembered that although the grammar compiles just fine, I
> couldn't get it to work with the ANTLRWorks debugger. It seems like it's the
> huge Identifier lexer rule that causes the problem - I had to temporarily
> replace it with something simpler in order to persuade the debugger to work.
> I presume this is a bug in ANTLRWorks - if anyone is interested in this, I
> can provide more information, perhaps off-line.
>
>  On 30/03/2008, *Benjamin Shropshire* <shro8822 at vandals.uidaho.edu> wrote:
>
> Chris Lambrou wrote:
>
>
> [snip]
>
>
>
> >    1. Unlike other whitespace characters, line separators (represented
>
> >       by my LT token type) are important in JavaScript, as you're
> >       allowed to use them to terminate statements instead of the usual
> >       terminating semicolon character. As a result, I cannot 'hide'
> >       line separators like other whitespace characters, and my grammar
> >       is peppered with LT!* sequences. Is there a way to place the LT
> >       tokens on the hidden channel, and then optionally reveal them
> >       only in the few rules that require it?
>
>
> [snip]
>
>
>
> It is most likely not kosher, but if you can look at an LT in a sequence
> of tokens test if it is a virtual semicolon (without knowing anything
> but the adjoining tokens) then some sort of preprocessor (I'm thinking:
> lex, filter tokens into new lex stream, parse) might be able to convert
> what is needed. You might call the filter a TokenSedStream or something
> like that. I did something like that (but with the text) to deal with
> indentation sensitivity in my only attempt with ANTLR. As I said, not
> kosher, but if all else fails "You gotta go with what works." (Law #37)
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080401/9d264420/attachment-0001.html