[antlr-interest] Missing something basic about lexer tokens

Fri Nov 19 17:56:06 PST 2010

Greetings! 

On Fri, 2010-11-19 at 18:58 -0500, Sheila M. Morrissey wrote:
> Hello,
> 
> I am working on a recognizer that processes a text file, each line of which starts with one of short list of about 20 characters (mostly either upper case or lower case letters, a few special chars), immediately followed by a "name" (chars or dash), a space or 2, and then various space-delimited stretches of text comprised of arbitrarily any ASCII character Except newline, followed by newline.
> 
> The first letter is significant - it indicates what sort of "command" each line is.
> 
> Here's a simplified version of the grammar, with just one of these "commands" specified:
> 
> grammar ElementAttributes;
> 
> options {
>   language = Java;
> }
> @parser::header {}
> @lexer::header {}
> 
> elementAttributes : elementAttributeCommand+ EOF;
> 
> /**
> e.g.
> Aname IMPLIED
> */
> 
> elementAttributeCommand : ACMD NAME SPACE+ ATTRTYPE NEWLINE;
> 
> ATTRTYPE : ('IMPLIED'|'CDATA'|'NOTATION'|'ENTITY'|'TOKEN'|'ID'|'DATA');
> ACMD : 'A';
> NEWLINE:    '\r'? '\n';
> SPACE:      ' ';
> NAME : (NAMESTARTCHAR NAMECHAR*);
> 
> fragment LOWERCASELETTER : ('a'..'z');
> fragment UPPERCASELETTER : ('A'..'Z');
> fragment DIGIT : ('0'..'9');
> fragment DASH  : ('-');
> fragment NAMESTARTCHAR : (LOWERCASELETTER | UPPERCASELETTER);
> fragment NAMECHAR : (NAMESTARTCHAR | DIGIT | DASH);
> 
> 
> If run on a file consisting only of the line (terminated with NEWLINE)
> Aname IMPLIED
> 
> I get the following error:
> line 1:0 required (...)+ loop did not match anything at input 'Aname'
> 
>  How should I be declaring the lexer rules so that 'A' at start of line is recognized as a command token, and yet still make it possible for the "NAME" immediately following it to be unambiguously recognized?
> 

Please recall 3 facts about current ANTLR lexers:
1) they recognize tokens independent from any parsing context; and
2) they do not back-track (once committed to recognizing a prefix of a
token the rest of the input must match that token); and
3) they are greedy and happily recognize the longest valid string
possible.

(i suspect you already know the above facts, but i repeat them in case
someone in the future searches the mailing-list archive at
markmail.antlr.org and finds this message without that knowledge)

and so, as you have observed, when the input word "Aname" is seen by
your lexer it will produce the token NAME because that single token
greedily matches all of the characters in that input word.

and so your requirement "at the start of the line" must be, somehow,
encoded into your lexer rule(s) for command(s) like ACMD.

i believe you can read a discussion of this issue by searching the
archives at markmail.antlr.org for messages about special tokens at the
beginning of a line.

i seem to remember (i haven't reviewed the archives) that it boils down
to 3 possibilities:

1) add a predicate(s) to test the start character index of the token to
ensure that it is at the beginning of a line

2) use a rule of the form ACMD : NEWLINE 'A' ; which works for the
second and subsequent lines of input. But requires creating a special
sub-class of the input reader that always delivers a NEWLINE as the very
first character and then delivers characters from the actual input after
as the second and subsequent characters. and then of course your parser
rules should not insist upon a NEWLINE at the end of a command (because
that NEWLINE is part of the verb that starts the next command). 

3) use a Lex-based lexer rather than and ANTLR-based lexer. search the
archives for jLex. Lex-based lexers are more oriented around regular
expressions -- so start-of-line and end-of-line are more easily
detected/used.

I also believe that Dr. Parr is looking at version 3 lexer issues such
as this one and is trying to improve things for version 4. search the
mailing list archives for Dr Parr's posts regarding version 4.

(as an aside i think most of my mailing list search suggestions will
actually result in pointers to pages in the wiki -- i am too lazy to
actually give you those links directly, sorry)

Hope this helps...
   -jbb