[antlr-interest] Lexer Predicates?

Sun Aug 3 15:28:56 PDT 2008

On August 03, 2008 1:46 PM, Gavin Lambert wrote:

> 

> At 06:34 4/08/2008, Foust wrote:

>  >Yes... it started out that way. But to allow spaces to be part

>  >of a config value (read up to EOL), the Lexer needs to honor

>  >state. (Place spaces in the HIDDEN channel for all other cases

>  >- outside of a special config/preprocessor rule).

> 

> Are you hiding the EOLs as well?  (Usually they're lumped in with

> whitespace.)

> 

> If so, then you'll have to match everything in the lexer anyway,

> since the parser won't be able to see the EOL.

Good point. I had to change the syntax to read up to a parser-visible
terminator to get it to even partially work (but the whitespace was still
missing). Gathering up the tokens with += and calling toString(from, to)
returned a single value, including the original whitespace, but mysteriously
stopped the lexer from returning any more input (the rest of the file seemed
to be discarded), so I abandoned that method altogether:

       /**

        * restores stripped whitespace to a range of tokens 

 * (Only the first and last entries of the input are used).

        * @return String representing given range of tokens with
reconstituted whitespace in between.

        */

       private String concatTokens (List matchedTokens)

       {

              int from = ((CommonToken)
matchedTokens.get(0)).getTokenIndex();

              int to   = ((CommonToken) matchedTokens.get(
matchedTokens.size() - 1)).getTokenIndex();

              return ((CommonTokenStream) input).toString(from, to);

       }

> As long as you

> have something fairly distinctive to start matching on, this

> shouldn't be hard, and you shouldn't need to do any parser->lexer

> contortions.  See how line and block comments are implemented in

> the examples.

Yes, I use something similar, if not identical, to the demos to parse
comments that works quite well:

LINE_COMMENT                              : '//' ~('\r' | '\n')* NEWLINE+ {
skip(); };

BLOCK_COMMENT options { greedy = false; } : '/*' .* '*/'                  {
skip(); };

> 

> (And if you can modify the language you're parsing, now would be a

> good time to make it use a quoted string or similar instead of

> simply reading to EOL.)

Yes, thank you for the suggestion. I did have to resort to changing the
terminator, but even that didn't solve the whitespace problem.

I thought the whole point of a Domain Specific Language was to make the task
easy on the user - not on the parser-generator. It seems that the issue is
that what is intuitive to a human may in fact be some chimera of two or more
formal syntaxes. Antlr does not handle this very well, forcing tokens to be
interpreted the same in every context. But since it allows interaction with
the target language, there are likely several ways to solve the problem. 

I thought that the cleanest way to read in a free-form config {.} block (not
requiring quotes, or other syntax that might, in fact be intended to be part
of the config setting) is to treat it as a separate language. I want to keep
the syntax as simple as possible and have no possibility of conflicting with
any other part of the language. So I solved this particular problem by:

-    using parser states

-    a predicate on the 'config' rule to only recognize it if in the correct
state

-    Implement a simple parser for just the block in question using regex in
the target language:

@members {

/** get any config values specified in the config {} block */

            HashMap<String, String> config = new HashMap<String, String>();

            private void parseConfig (String configDefs)

            {

                  String[] lines = configDefs.split("[\\r\\n]+\\s+");

                  for (String line : lines)

                  {

                        String[] part = line.split("\\s*:\\s*");  // split
on colon

                        String name = (part.length > 0) ? part[0] : "";

                        String value = (part.length > 1) ? part[1] : "";

                        config.put(name, value);

                  }

            }

      }

The grammar rules to 1) recognize the "config" block first, and 2) make sure
"config" is not a keyword and can be used elsewhere in the grammar, looks
like this:

  start     @init {allowConfig = true;}

: config? objectDefinitions EOF ;

// only recognize config block before Object Definitions

  config    : {allowConfig &&
input.LT(1).getText().equalsIgnoreCase("config")}?=>   

            NAME '{' configBlockText '}'        // block of config settings
(possibly empty)

            { parseConfig($configBlockText.text); }

            ;

  configBlockText : ~'}'* ;

  objectDefinitions @init {allowConfig = false;} // config block (and
keyword) no longer recognized

                .

That was a lot simpler than struggling with the Antlr lexer.

Brent

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080803/22c6bfb4/attachment-0001.html