[antlr-interest] Customizing token separators without recompiling

Sun Jun 7 17:14:19 PDT 2009

I don't know if this is any closer, but I had this idea.

Your problem seems to be in getting a lexer which will give you the
right stream of tokens, and not in writing the parser that feeds off
them. You could write your own lexer to do the splitting of the
strings, and use ANTLR to write the parser. ANTLR parsers don't feed
directly off a string, but off an ITokenSource object;

    public interface ITokenSource
    {
        string SourceName { get; }
        IToken NextToken();
    }

You could create your own token source which would do the separation
by hand, and return a stream of tokens. Something like

    public class UnEdifactLexer: ITokenSource
    {
        // token types
        public const int EOF = -1;
        public const int ID = 0;
        public const int NUMBER = 1;
        public const int COLON = 2;
        ...

        // all the tokens in the input
        private Queue<IToken> tokens;

        public UnEditfactLexer(string input, char userSeparator)
        {
            this.tokens = new Queue<IToken>();
            foreach(var line in input.Split('\n'))
            {
                foreach(var piece in CustomSplit(userSeparator))
                {
                    // custom code to convert a line
                    // into a set of tokens
                    tokens.Enqueue(new Token(...));
                }
            }
        }

        public IToken NextToken()
        {
            if (tokens.Count > 0)
                return tokens.Dequeue();
            else
                return new Token(EOF,...);
        }
    }

Then you write a parser grammar in ANTLR which does the parsing and
tree-building.

Anyway, the benefit of this approach is that you have full power over
splitting up the strings and converting them into tokens. After that,
the parser takes up the strain.

    Steve