[antlr-interest] ANTLR equivalent of JavaCC Lexer behaviour?

Mon Mar 27 16:05:03 PST 2006

I'm afraid your only option is to either increase k to '2' so it can
distinguish between "is" and "is" "not" in separate rules or to left
factor as you've suggested. I think you can specify k=2 as a local
option to particular rules.

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of
Richard.Kennard at mail.thomson.com
Sent: Tuesday, 28 March 2006 11:02 AM
To: antlr-interest at antlr.org
Subject: RE: [antlr-interest] ANTLR equivalent of JavaCC Lexer
behaviour?

Robert,

Thank you for your response! I tried your suggestion, but I'm afraid it
only succeeded in moving the problem :(

My grammar now looks like...

	equalsCondition :
		"is" ("a")? relativeParameter
		{
			addEqualsCondition();
		}
	;

	notEqualsCondition :
		"is" "not" relativeParameter
		{
			addNotEqualsCondition();
		}		
	;

...and when parsing something like...

	if Item is not blank

...it gets confused saying...

	line 5:12: unexpected token: not

...because, presumably, it is going down the path of the first
equalsCondition upon seeing the "is", not the second notEqualsCondition.
I'd really rather avoid having to do something like...

	equalsOrNotEqualsCondition :
		"is" (
			( "a" )? relativeParameter
			{
				addEqualsCondition();
			}
			|
			"not" relativeParameter
			{
				addNotEqualsCondition();
			}
			)
	;

...especially when JavaCC handles this so much more compactly.

Your suggestions are much appreciated,

Richard.

-----Original Message-----
From: PATERSON, Robert [mailto:r.paterson at ioof.com.au]
Sent: Tuesday, 28 March 2006 10:37 AM
To: Kennard, Richard
Subject: RE: [antlr-interest] ANTLR equivalent of JavaCC Lexer
behaviour?

The way to do this in ANTLR would be leave only your IDENTIFIER token in
the Lexer and then in the parser refer to your literal tokens directly.
You may need to set testLiterals=true in the options, but I think this
is the default. What ANTLR does is maintain a list of any string
literals used in the grammar and whenever it finds a token it tests the
text against this list of literals to see if it's just a plain old
identifier or if it's a special keyword.

Something like:

	class BusinessParser extends Parser;

	options
	{
		k=1;
	}

	equals :
         "is" ("the" "same" "as")?
         |
         "the" "same" "as"
         |
         "are" ("the" "same" "as")?
         |
         "of"
      ;

	class BusinessLexer extends Lexer;

	options
	{
		k=1;
	}

	IDENTIFIER: ('a'..'z'|'A'..'Z'|'_'|'$')
('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'$')*; 

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of
Richard.Kennard at mail.thomson.com
Sent: Tuesday, 28 March 2006 10:21 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] ANTLR equivalent of JavaCC Lexer behaviour?

Dear All,

I am looking to migrate an existing grammar from JavaCC to ANTLR, but am
having difficultly with the Lexer.

Specifically, my grammer is very 'English-y', and while JavaCC appears
to employ (I'm guessing here) a rather forgiving 'longest match' Lexer,
ANTLR warns me to specify an actual 'k=x' lookahead number. I have found
this number needs to be pretty large (17) to stop the warning, at which
point ANTLR seems to crash (and besides
http://www.antlr.org/doc/options.html warns against it, saying 'at large
depths will include almost everything').

Here is a snippet of my working JavaCC grammer...

	PARSER_END( BusinessLanguage )

	TOKEN :
	{
		< EQUALS: "is" | "is the same as" | "the same as" |
"are" | "are the same as" | "of" >
	|	< NOT_EQUALS: "is not" | "is not the same as" >
	|	< LESS_THAN: "is less than" >
	|	< IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
	}  	

...and the sort of thing it parses...

	if Status is "Closed" then error "Already closed"
	if Version is less than 1 then error "Version cannot be less
than 1"

...and here is what I tried in ANTLR...

	class BusinessLexer extends Lexer;

	options
	{
		k=17;
	}

	EQUALS: "is" | "is the same as" | "the same as" | "are" | "are
the same as" | "of";
	NOT_EQUALS: "is not" | "is not the same as";
	LESS_THAN: "is less than";
	IDENTIFIER: ('a'..'z'|'A'..'Z'|'_'|'$')
('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'$')*;

Clearly there is a lot of contention in this grammer, but is there a way
to get the equvialent JavaCC behaviour? I would rather not have to code
something along the lines of...

	("is" ("not" | "less than")) | ("are" ( "the same as" ))

Your wisdom is most appreciated :)

Regards,

Richard.

****************************************************************************
IMPORTANT - PLEASE READ
This communication is intended only for the use of the addressee and 
may contain personal information, confidential information or legally 
privileged information. If personal information is contained in this e-mail, 
then it is governed by the Privacy Act 1988 and must be treated in 
accordance with the Privacy Act 1988 by the recipient. The legal 
privilege and confidentiality attached to this e-mail is not waivered, 
lost or destroyed by reason of mistaken delivery to you. If you are not 
the intended recipient, we would appreciate immediate notification by 
return e-mail or telephoning +61-3-8614-4444 and ask that the message 
be permanently deleted from your system. If you are the intended 
recipient of this communication you should not copy, disclose or distribute this communication without the authority of IOOF or its related entities (the IOOF Group).
Any views expressed in this message are those of the individual sender, 
except where they are specifically stated to be the views of the IOOF 
Group.
This e-mail and any attachments have been scanned for computer viruses 
using anti-viral software, but you should also perform your own scan. We 
do not accept liability for any loss or damage, whether caused by our own negligence or not, that results from a computer virus or a defect in the transmission of the e-mail or any attached file.
****************************************************************************