[antlr-interest] Re: Still having problems with the lexer code

Mon May 13 02:11:45 PDT 2002

Hi,

On Thu, May 09, 2002 at 10:56:18PM -0000, johnclarke72 wrote:
> I hope that I am not asking much how could I get this to work ?  
> Would it also be possible to explain why it would work in the 
> ammended version ?

There's a number of things that can go wrong in the setup of a project with
multiple parsers/lexers.

- depending on how you organized the files and did import/export vocabs,
  and depending on the order of compilation the different parsers/lexers
  in your project might end up with a different understanding of the tokens.
- mistakes in begin/end tokens and lexer switching.

In files you sent in there's probably a problem with probably both and the
mistake Terence noted.

DISCLAIMER: I did not really try to understand the exact parsing problem
you have I'm assuming your parsing html and want to treat comments
differently or something.

Your HTMLParserApp.java looks good.

Suppose:

<htmlstuff>
<!-- a comment -->
</htmlstuff>

Now just after the parser is started the selector is using the textlexer.
So this one starts recognizing 'words'. Until The begin tag "<!-" is seen.
This tag is read from the input stream then the taglexer is started. At
that point the file pointer is after the "<!-". (So in taglexer you have to
remove the "<!--" as Terence noted (note also the difference in tokens
"<!-" vs. "<!--" probably typo, antlr does very little to protect you from
typos like this :( ))

Ok. Now we're inside the taglexer here you'd want to keep on parsing using
the taglexer up to and including the close tag "-->" after that you need to
switch back to the textlexer, since we have more file to parse and we're
definitely not in a html comment anymore. E.g. add an action to the close
token:

HTMLCOMMENT : 
   (options { greedy=false; }: .) *
   "-->" { selector.select("HTMLTagLexer"); }
;

So now you switched back to the textlexer (note that we now enter the
textlexer with the input stream after the close tag) Now the mechanics of
switching between lexers should be ok. (Never try to switch lexers from the
parser unless you really know what you are doing, in general this does not
work)

Next problem. You may have noted that in the multiLexer example a Common
vocabulary is imported in several places. This one is used to make sure
that all lexers/parsers have the same understanding about the tokens (e.g.
that both lexers return the same values for common tokens to the parser)

This you can do by A) using the CommonTokenTypes import trick as in the
multilexer example B) chaining the lexers/parsers and observing the right
order in processing them with antlr.

An option is doing it like this:

In the textlexer:

   exportVocab = Text;

In the taglexer:

   importVocab = Text;     // get previous definitions
   exportVocab = HTMLTags; // Export the Vocabulary to HTMLTags

In the parser:

   importVocab = HTMLTags; // The Vocabulary to import

Then *always* observe the following order in processing the files with antlr:
1: textlexer 2: taglexer 3: parser.

You can always check the generated xxxTokenTypes.txt files to see what
numbers are given to tokens.

Hope this helps,

Ric
-- 
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- klaren at cs.utwente.nl ----- +31 53 4893722  ----
-----+++++*****************************************************+++++++++-------
     Human beings, who are almost unique in having the ability to learn
   from the experience of others, are also remarkable for their apparent
         disinclination to do so. --- Douglas Adams, Last Chance to See

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/