[antlr-interest] Identifiers with Spaces

Mon Nov 29 14:33:23 PST 2010

Hi William!

On Fri, 2010-11-26 at 21:42 -0700, William Clodius wrote:
> There are workarounds for your specific problem, but in general I would suggest a complete revision of your approach.

Which other workarounds are there?  Can you give me some pointers?

Does this mean that there is no simple solution with ANTLR?

I played around with it some more and noticed that my lexer rules
are actually just regular expressions.  This is probably the usual
case for lexers.  So I just threw my problem at gnu sed and
it solves my tokenization problem perfectly:

command: sed 's/\(a\+\( \+a\+\)*\| \|=\)/[\1]/g'
input: a aa = aa
output: [a aa][ ][=][ ][aa]

Granted, the syntax is ugly and I would have to somehow put this into
code. But it gave me the idea of creating a simple preprocessor
that frames the identifiers with \u0002 and \u0003, such that
ANTLR recognizes them without problem.

> What you are trying to do is generally better addressed during the semantic analysis, then during the lexical construction. I suggest the following approach
> 
> id_sequence : ID ID*
> 
> where ID is whatever you allow in an identifier between spaces. Then during the semantic analysis wherever you find an id_sequence in effect treat the first ID as a function that takes the rest of the id_sequence as an argument returning an "identifier". This analysis can be performed recursively fore each ID in the sequence. The implementation is straightforward, but tedious, and of course left to the student.

Actually the spaces are part of the identifier and are significant.
That means I would have to know how many identifiers were between the
two IDs of an id_sequence.  I saw somebody mention that you could
somehow access the hidden channel used to ignore spaces but I did
not find any good explanation of how to do that.

Michael