[antlr-interest] help requested for selective whitespace
Scott Amort
jsamort at sympatico.ca
Fri Feb 3 13:33:07 PST 2006
Martin Probst wrote:
> Hi,
>
> maybe I should write a bit more about my other email. As far as I
> understand it, your problem is that you want single identifiers like "a"
> or "b", always length == 1, to be separated by whitespace. This doesn't
> work if you have a rule that consumes exactly one character, as you
> cannot be sure if whitespace we're between the identifiers. So my idea
> is to just parse longer identifiers like this:
> IDENT: ('a' .. 'b')+;
> and then check in that Lexer rules if the token was actually longer than
> one character, in which case you throw an exception with the error
> message:
> IDENT: ('a' .. 'b')+ { if ($getText().length() > 1) throw ... };
> Does that work for you?
>
> Martin
>
>
Hi Martin,
Thanks very much for the response. I have modified the grammar a fair
bit since that first message, and I was trying to use some simple
examples to explain my point. Here is the more complete and detailed
version:
The lexer needs to recognize musical note names, upper or lower case a
through h, as well as an alternative method of identifying the musical
note through solfege (i.e. do, re, mi, etc.). Then, there is a variety
of optional data that may be appended to that note name to make up the
full note description. For simplicity's sake, I'll leave out the
solfege options. So, roughly, I have in the lexer:
NOTENAME
: 'a'..'h' | 'A'..'H'
;
OCTAVE
: ('0'..'9')+
;
DURATION
: '*' ('0'..'9')+ ( '/' ('0'..'9')+ )?
;
DOT
: '.'
;
And then in the parser:
note_desc
: NOTENAME (OCTAVE)? (DURATION)? (DOT)?
;
The lexer ignores whitespace. Now, the problem is, I require that there
be no whitespace between any of the tokens making up note_desc, but
currently there is no distinction made between the correct input:
a8*1/4
and the incorrect input:
a 8 *1/4
Now, I realise that I could bring the note_desc portion of the parser
into the lexer, but then I lose the ability to easily form an AST with a
NOTENAME node and the subsequent data as child nodes, which is very
helpful for later transformations. The other option is to allow the
lexer to send WS tokens to the parser, but in all other instances it is
safe to ignore whitespace, and I really don't want (WS)? tokens
cluttering things up in the parser.
Looking at your suggestion, Martin, I see that it would work fine if I
had a fixed length note_desc, but it can be of variable length depending
on the appearance of some of the optional data.
So, you can see my dilemma! What is the best way to approach this
problem? Thanks very much for you assistance!
Best,
Scott
More information about the antlr-interest
mailing list