[antlr-interest] JavaScript grammar

Sat Mar 29 22:43:00 PDT 2008

The approach I took was to keep LT tokens on the default channel, for the
benefit of the few rules where line terminators are important. For example,
the ECMAScript spec defines a return statement as follows:

'return' [no LineTerminator here] Expression[optional] ';'

In the ANTLR grammar, this becomes:

returnStatement : 'return' expression (LT | ';')! ;

You can see that the statement may end either with a semicolon or with a
line terminator, which the ECMAScript spec permits. This appears to work
just fine. The only part I find annoying is that because the LT tokens are
not on the hidden channel, all of the other rules need to deal with them
too. For example, the ifStatement rule looks like this:

ifStatement : 'if' LT!* '(' LT!* expression LT!* ')' LT!* statement (LT!*
'else' LT!* statement)? ;

when it would be much clearer if it looked like this:

ifStatement : 'if' '(' expression ')' statement ('else' statement)? ;

This affects all of the parser rules, which makes the grammar less readable.
I did think about performing some filtering of the token stream between the
lexing and parsing phases, but the rules for automatic semicolon insertion
defined by the ECMAScript are a bit nasty. It's just too difficult to
actually determine where the virtual semicolons should be, without the
grammatical context that only the parsing stage can provide. In any case, I
think the grammar would be less useful if it required any special runtime
tweaks to make it work.

I think what would really solve the problem would be to have the LT tokens
on the hidden channel by default, and then dynamically switch them to the
default channel only for those rules that require it. I'm afraid I'm not yet
familiar enough with channels and token streams to know if this is even
possible. My guess is that it probably isn't, but it can't hurt to ask.

Chris

P.S. I've just remembered that although the grammar compiles just fine, I
couldn't get it to work with the ANTLRWorks debugger. It seems like it's the
huge Identifier lexer rule that causes the problem - I had to temporarily
replace it with something simpler in order to persuade the debugger to work.
I presume this is a bug in ANTLRWorks - if anyone is interested in this, I
can provide more information, perhaps off-line.

On 30/03/2008, Benjamin Shropshire <shro8822 at vandals.uidaho.edu> wrote:
>
> Chris Lambrou wrote:

[snip]

>    1. Unlike other whitespace characters, line separators (represented
>
> >       by my LT token type) are important in JavaScript, as you're
> >       allowed to use them to terminate statements instead of the usual
> >       terminating semicolon character. As a result, I cannot 'hide'
> >       line separators like other whitespace characters, and my grammar
> >       is peppered with LT!* sequences. Is there a way to place the LT
> >       tokens on the hidden channel, and then optionally reveal them
> >       only in the few rules that require it?

[snip]

It is most likely not kosher, but if you can look at an LT in a sequence
> of tokens test if it is a virtual semicolon (without knowing anything
> but the adjoining tokens) then some sort of preprocessor (I'm thinking:
> lex, filter tokens into new lex stream, parse) might be able to convert
> what is needed. You might call the filter a TokenSedStream or something
> like that. I did something like that (but with the text) to deal with
> indentation sensitivity in my only attempt with ANTLR. As I said, not
> kosher, but if all else fails "You gotta go with what works." (Law #37)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080330/f58662d2/attachment.html