[antlr-interest] distinction between newline and ws

Sun Oct 21 03:10:52 PDT 2007

Hi,

thanks for the answers so far. I found something out, too. If i just swap
the lines for NEWLINE and WS, the grammar behaves differently, the NEWLINE
then is not matched. So i guess the order, in which those rules are set in
the grammar, makes a difference. Nevertheless, actually i would have
thought, the correct way to set the rules for NEWLINE and WS would be:

NEWLINE     :     '\r'? '\n';
WS    :     (' '|'\t')+ {skip();};

Means, i do not put \n and \r in WS.

Perhaps i should simply read on in the book ;o)

________________________________________
Von: Joseph Gentle [mailto:josephg at cse.unsw.edu.au] 
Gesendet: Sonntag, 21. Oktober 2007 02:58
An: antlr-interest at antlr.org
Betreff: Re: [antlr-interest] distinction between newline and ws

[forgot to reply all]

I can't find the documentation for it, but ANTLR does seem to have token
matching precedence rules.

Have a play with it - write a tokeniser like this:

test : ( TEXT | NEWLINE | WS )*;
TEXT : 'x'+;

NEWLINE     :     '\r'? '\n';
WS    :     (' '|'\t'|'\n'|'\r')+;

and pass it some strings with newlines and whitespace and whatnot. Have a
look at the token stream generated. I've got a feeling that antlr prefers to
match earlier tokens to later tokens. Using your rules, I expect that a line
of text followed immediately by a newline will become TEXT NEWLINE whereas a
line of text followed by whitespace then a newline will be TEXT WS. This is
because by default the + in the WS rule is greedy and will consume the
newline as well, if it can.

Have a play!

-J

Sven Busse wrote: 
hello,

i am very new to antlr and language recognition. So i bought the book
from Terence Parr and now i am currently working through the first
example, the calculator. And unfortunately already, i don’t understand
something. The grammar looks like this:

grammar Expr;

prog  :     stat+ ;

stat  :     expr NEWLINE
      |     ID '=' expr NEWLINE
      |     NEWLINE
      ;

expr  :     multExpr (('+'|'-') multExpr)* ;

multExpr:   atom ('*' atom)* ;

atom  :     INT
      |     ID
      |     '(' expr ')'
      ;

ID    :     ('a'..'z'|'A'..'Z')+;
INT   :     '0'..'9'+;
NEWLINE     :     '\r'? '\n';
WS    :     (' '|'\t'|'\n'|'\r')+ {skip();};

My Question now is, how does antrl know, that “\n” should match to a NEWLINE
instead
of WS (which would mean, it would skip it)? I would have thought, this
grammar is
ambiguous, but apparantly, it isn’t. Why not?

Thank you
Sven