[antlr-interest] help requested for selective whitespace

Fri Feb 3 13:33:07 PST 2006

Martin Probst wrote:
> Hi,
>
> maybe I should write a bit more about my other email. As far as I
> understand it, your problem is that you want single identifiers like "a"
> or "b", always length == 1, to be separated by whitespace. This doesn't
> work if you have a rule that consumes exactly one character, as you
> cannot be sure if whitespace we're between the identifiers. So my idea
> is to just parse longer identifiers like this:
> IDENT: ('a' .. 'b')+;
> and then check in that Lexer rules if the token was actually longer than
> one character, in which case you throw an exception with the error
> message:
> IDENT: ('a' .. 'b')+ { if ($getText().length() > 1) throw ... };
> Does that work for you?
>
> Martin
>
>   
Hi Martin,

Thanks very much for the response.  I have modified the grammar a fair 
bit since that first message, and I was trying to use some simple 
examples to explain my point.  Here is the more complete and detailed 
version:

The lexer needs to recognize musical note names, upper or lower case a 
through h, as well as an alternative method of identifying the musical 
note through solfege (i.e. do, re, mi, etc.).  Then, there is a variety 
of optional data that may be appended to that note name to make up the 
full note description.  For simplicity's sake, I'll leave out the 
solfege options.  So, roughly, I have in the lexer:

NOTENAME
  : 'a'..'h' | 'A'..'H'
  ;

OCTAVE
  : ('0'..'9')+
  ;

DURATION
  : '*' ('0'..'9')+ ( '/' ('0'..'9')+ )?
  ;

DOT
  : '.'
  ;

And then in the parser:

note_desc
  : NOTENAME (OCTAVE)? (DURATION)? (DOT)?
  ;

The lexer ignores whitespace.  Now, the problem is, I require that there 
be no whitespace between any of the tokens making up note_desc, but 
currently there is no distinction made between the correct input:

a8*1/4

and the incorrect input:

a 8 *1/4

Now, I realise that I could bring the note_desc portion of the parser 
into the lexer, but then I lose the ability to easily form an AST with a 
NOTENAME node and the subsequent data as child nodes, which is very 
helpful for later transformations.  The other option is to allow the 
lexer to send WS tokens to the parser, but in all other instances it is 
safe to ignore whitespace, and I really don't want (WS)? tokens 
cluttering things up in the parser.

Looking at your suggestion, Martin, I see that it would work fine if I 
had a fixed length note_desc, but it can be of variable length depending 
on the appearance of some of the optional data.

So, you can see my dilemma!  What is the best way to approach this 
problem?  Thanks very much for you assistance!

Best,
Scott