[antlr-interest] Antlr first time user, help requested

Tue Jul 6 06:47:41 PDT 2010

Greetings!

On Mon, 2010-07-05 at 22:18 -0600, Andrew Robinson wrote:
> Sorry to say that ANTLR is driving me nuts, starting to really hate
> the tool, so I'd really appreciate some help on it before I give up on
> it.

Sorry for your frustration. I hope you hang in there. I think ANTLR is
worth its steep learning curve...

> 
> I am trying to parse a simple bit of text that looks something like this:
> 
> PageMetaData:
> name: This is a test name
> categories: category1, category2,
>   category3
> notes: These are notes
>   that the newlines are important, but not the leading whitespace
> 
> So the idea is the script always starts with "PageMetaData:\n"
> The name section should ignore leading whitespace after the color, and
> take in any text to the end of the line, including white space
> The categories section is a comma separated set of camel-cased words
> that can be one one or more lines. Subsequent lines should lead with
> one or more spaces
> The notes section should allow multiple lines as long as they all
> start with leading white space.
> This is going to get a bit more complex, but you get the idea.
> 
> My grammar file is at the bottom of this email (not sure if this ML
> supports attachments). 

It does seem to support attachmenets.

> It fails miserably (keep running into
> mismatched token exceptions on the testName matching). Here is my
> input text:
> PageMetaData:
> name: This is a test name
> categories: category1, category2,
>   category3
> notes: These are notes
>   that the newlines are important, but not the leading whitespace
> 
> So after trying many different variations I tried a very simple
> grammar to step back to basics (or so I thought). Grammar:
> grammar Test;
> 
> prog
> : 'name:' NONBREAK NEWLINE? EOF!;
> 
> NONBREAK
> : (~('\n'|'\r'))+ ;
> 
> NEWLINE:'\r'? '\n' ;
> 
> Input (quotes included to show that there is a new line):
> "name: test
> "
> 
> In the 1.4 ANTLRWorks Intrepreter I get a
> MismatchedTokenException(4!=6) with this setup. What the heck this is
> pretty basic?
> 

your NONBREAK lexer rule is gobbling up tooooo much.

recall that ANTLR lexers are greedy. lexer rules match the longest
viable input string without any regard to the parsing context.

so your NONBREAK token happily eats up the entire input line, including
the leading "name:" portion, so your parser never sees that token.
(I agree that the ANTLRWorks Interpreter's error message is less than
helpful here... )

> I am also seeing problems with the windows EOL matching but not the
> unix matching in ANTLWorks when I add a newline (using that above
> newline token), but I am on Ubuntu Linux, not sure what is going on
> there.

I do not use ANTLRWorks so can not help much with that. I too use Ubuntu
and I just use the command-line org.antlr.Tool (and of course emacs and
the antlr-mode).

> Would really appreciate some hints here.

Basically I think you should make your TEXT lexer rule below be a parser
rule ---- which requires an additional lexer rule also. so we would have
(all of this is pretty much un-tested...):

text : ~NEWLINE;

OTHER : . ;

be SURE that the OTHER lexer rule is the very LAST rule in the .g file.
change all references to TEXT to be text.

also, usually single line comments are given by:

// single-line comments
SL_COMMENT :
      '//'
      ( options { greedy=false; } : . )*
      ( '\r' | '\r\n' | '\n' ) // EOL untested under MS-Windows
      { $channel=HIDDEN; }
   ;

you might also need to make your COMMA rule a parser rule...

> Thank you
> 

Hope this helps
   -jbb

> 
> 
> Grammar file from above:
> grammar PageMetaData;
> 
> options {
>   output = AST;
> }
> 
> tokens {
> 	HEADER_TEXT = 'PageMetaData:' ;
> 	NAME_LABEL = 'name:' ;
> 	CATEGORIES_LABEL = 'categories:' ;
> 	TAGS_LABEL = 'tags:' ;
>   NOTE_LABEL = 'note:' ;
>   AUTOMATED_TESTS_LABEL = 'automated-tests:' ;
>   AUTOMATED_TEST_LABEL = 'automated-test:' ;
>   COMMENT;
>   CAMELCASE;
>   FILE;
>   COMMA;
>   TEXT;
> }
> 
> COMMENT
>   :	'//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>   ;
> 
> NEWLINE	: ('\r' '\n' | '\n' | '\r' );
> 
> CAMELCASE
> 	:	('A'..'Z'|'a'..'z'|'0'..'9')+;
> 
> FILE
> 	:	('A'..'Z'|'a'..'z'|'0'..'9'| '_' | '-' | '.' | '/')+;
> 
> COMMA
> 	:	',' (' '+ | NEWLINE ' '+)?;
> 
> TEXT : (~('\r'|'\n')+);
> 
> definition
> 	: (NEWLINE | ' ')* header NEWLINE
>   testName NEWLINE
>   categories NEWLINE
>   (tags NEWLINE)?
>   (note NEWLINE)?
>   (automatedTests NEWLINE)?
>   (automatedTest NEWLINE)?
>   EOF! ;
> 	
> header
> 	: HEADER_TEXT
> 	;
> 
> testName : NAME_LABEL TEXT;
> 
> categories
> 	: CATEGORIES_LABEL CAMELCASE (COMMA CAMELCASE)* ;
> 
> tags
> 	:	'tags:' (' '*)! CAMELCASE (COMMA CAMELCASE)* ;
> 
> note
> 	: 	'note:' (' '*)! TEXT (NEWLINE+ TEXT)* ;
> 
> automatedTests
> 	:	'tests:' (' '*)! FILE (COMMA FILE)* ;
> 
> automatedTest
> 	:	'test:' (' '*)! TEXT ;
>