[antlr-interest] nested parsing (BSDL)

Harald Mueller harald_m_mueller at gmx.de
Thu Jan 3 05:47:24 PST 2008


Hi - 

You wrote a very intersting (and very long) email about your experience and frustrations with ANTLR (and AntlrWorks) - I would heartily invite anyone (especially the core ANTLR team) to read through this, at least your final "tree of bugs" ...

... even though I do not agree with you on many points (or maybe only on a few central points). I'll try to address these items - and then let the ANTLR people find out what should be done ...

Yesterday, I've posted another solution to your problem - you may or may not have read it - which actually does have only a single grammar file. It is maybe not completely fair, but I'll use that solution to compare what we did.
First, I think we both ended up with the same concept - having that rule (structured_string | structure) somewhere which allows both a string-enclosed and a non-enclosed version of the same syntactic structure.

-----
Here's a short comparison of your approaches:

>    - Single grammar file
"Me too."

>    - Makes good AST trees if you want them
"Me too," I'd say - the AST looks exactly as if the text in the string was written verbatim in the file.

>    - Use of AST trees is not necessary
>      Can use one assignment to a data structure or a function call
>      for more complicated assignments in each major rule.
Me not really: The separate parser might make it necessary to share some state over both of them by passing the "parent parser" to the "child parser"; however, this is a standard delegation pattern which costs only a few lines.

>    - doesn't have lex/yacc anachronisms like predefining
>      lexical tokens.
"Me too."

>    - extending to support other structured strings does not require
>      anything more than new grammar rules
Here, I need a new method which sets up a lexer+parser to run inside the string; however, even this method can be generalized (nicely in languages with generics) so that only 2 lines or so of calling code remain.
(I assume that the lexical syntax of all structured strings in one grammar is the same - e.g., they all are combined with & and can have nested comments, but nothing else).

>    - If it wasn't for a confounding series of bugs, the single
>      grammar file would have been conducive to using antlrworks
>      debugger.
See first item.

>    - yours uses InnerParser.  I can't find that documented anywhere.
??? It's a parser class created from the grammar I defined in Inner.g in the two-grammar (+lexer grammar) version.

> Thus it keeps my options open.
I hope mine (the single grammar file version) would also.

> Yours has an advantage if the island grammar is lexically incompatable
> with the parent grammar.   However, I thought I was pretty clear about
> wanting to do this with one grammar.
Yes - these are definitely two different kind of fish ...

> 
> 
> The basic form is:
> 
>     rule: blah blah (ugly_string | parse_inside_ugly_string) blah blah;
> 
> Where "ugly_string" is a lexer rule that matches the string and
> then embeds the text  and "parse_inside_ugly_string" is a parser rule that
> parses inside the string.   You end up with a grammar that will parse
> a file with or without quotes on the subordinate grammar.

...

Re Antlrworks, I must say that I dont use it ... I'm just much faster setting up a new (or copying on of my) small environment for running something; re gramamr debugging, up to now my knowledge of LL(k) parsing and the (not always obvious) ANTLR rules have allowed me to find grammar troubles in acceptable time ... but this is a very personal judgment.

> [BUG]Unfortunately, trying to define a psuedo-global or even global
> in antlr is a frustrating experience, particularly since I am
> using antlrworks and everthing must be in the parser file.

I'm not sure what you mean here. But defining something global in Java could be done by just creating a static inner class, which would go into the @members section - but that's not an ANTLR issue.

Or do you mean you want to share something between lexer and parser? why don't you just pass them a common context which holds the data you want to exchange (except that ... sigh ... any such behavior will most probably not be called in AntlrWorks, and hence you lose some debugging capability ... probably exactly for the nerve-racking issues you're trying to solve ...).


The next items you write essentially want to control the lexer from the parser: As far as I understand ANTLR 3, it is simply not designed for this; rather, the idea is that the tokization happens *completely up front*. I - as you - can cite numerous cases where this is not acceptable (e.g. reading from an endless stream - pipe, network, ...), but at the moment, this seems to be frozen.
E.g.:
> [BUG]ANTLR doesn't appear to be smart enough yet to automatically gate the
> lexar based on the parser context.   Seems that this could potentially
> reslove a lot of grammar issues.   One example would be the use of
> a keyword as an identifier.
>    - Pass a bitmap into the lexer indicating the valid tokens
>      at this level of grammar.

Yes. The alternative is to move up such decisions into the parser (I wrote a posting about "lexing parsers" some time ago). It helps to solve such issues - whether it is "beautiful" ... well ...

[...]
> I have spent an entire day trying to do something (communicate
> a single bit of information to the lexer) that should have been
> trivial.

A trivial answer to this is: You sholdn't have tried so long ... simply moving to the parser level would have solved that issue as I see it - but this obviously said with hindsight (and after I also learned the issue of up-front tokenization the hard way; ANTLR 2, e.g. could switch lexers and parsers on the same input stream - I concept I used in a few places, and where I'm now scrathcing my head how I should replace this in ANTLR 3 [currently, I simply don't upgrade those projects ...]).

> [BUG]Now because antlrworks only handles Java, I am going to need C
> and Java code for the dirty tricks.   And since antlr lacks support
> for conditionally including multiple languages, ...

As I said, I don't use AntlrWorks, so I can't say why this support is not (or not fully) there ...


> STRING_LITERAL:
>       '"' STRING_CONTENTS_FRAGMENT '"'
>         (WHITESPACE_COMMENT_FRAGMENT* '&' WHITESPACE_COMMENT_FRAGMENT* 
> '"'STRING_CONTENTS_FRAGMENT '"')*
>        {
>          int i;
>          boolean instring=false;
>          boolean incomment=false;
>          String s;
>          StringBuffer d = new StringBuffer(65536);
>          s=getText();
>          for(i=0; i<s.length(); i++) {
>             // if(s.charAt(i) != '"') d.append(s.charAt(i));
>             if(!incomment && s.charAt(i) == '"') {
>                 instring = !instring;
>                 continue;  // don't want to add character to string below
>             }
>             if(!instring && s.charAt(i) == '-') { // since it has already 
> been lexed, one dash is enough
>                incomment=true;
>             }
>             if(s.charAt(i) == '\r' || s.charAt(i) == '\n') {
>                incomment=false;
>             }
>             if(instring) d.append(s.charAt(i));
> 
>          }
>          setText(d.toString());
>          // char x[] = { 'a', 'b', 'c', 'd' };
>          // PUSHSTREAM(new pANTLR3_INPUT_STREAM(new 
> ANTLRStringStream(x,4)));
>          // PUSHSTREAM(new pANTLR3_INPUT_STREAM(new 
> ANTLRStringStream(d.toString())));
> 
>          // Adapted from ANTLR FAQ: How do implement include files?
>           SaveStruct ss = new SaveStruct(input);
>           includes.push(ss);
> 
>           // switch on new input stream
>           setCharStream(new ANTLRStringStream(d.toString()));
>           reset();
>       }
>       ;

I would really hope that any solution which does not do such manual parsing is to be preferred (like mine [the last one]???). Yet, I see that such things are possible and even FAQ'd ... not happy with this.

> In response to one of your comments that what I was trying to do,
> parse inside strings, is unusual, I think that is far from the case.
> >From a programming language perspective, it is unusual.   For data
> files, though, it is very common.

Granted! - I have not done much work with such (for whatever reasons - just happens to be so).

> One example would be SVG where they made the huge mistake of

... well, badly design languages are certainly an issue - although the question up to which point a tool must support which strange things is probably a very open-ended one.

> [BUG]Not parsing inside strings

Bug ... well: I think the 10 lines or so I need to set up an additional lexer+parser which then runs through the string with the grammar already there isn't that much - so whether ANTLR needs to support anything more here, I doubt (except if you want to use AntlrWorks on such nested things: Then, an ANTLR-language-level notation is needed ... AntlrWorks people: What would you say?)

> Some examples of structured stings that are likely to be common
> in data files:
[...]

Oh - I must confess that only now I understand your view - and I must agree: All those file formats which declare "string escapes" = say that they might or might not enclose something in a string only to protect some text from "another semantics" (e.g., when a comma means "column separator" in a file; but should mean "list separator" for a single-column entry at some point) are candidates for such optional in-string/directly-in-text parsing switching.
-> you are right, this is much more frequent than I thought.

> 
> [BUG]Thus, in the following example (where a string has all the baggage 
> of
> the BSDL example), a very simple notation such as
>      LEXER_TOKEN --> STRIPANDPARSE( LEXER_FRAGMENT, root_rule)
> could be used to say:
>      When you encounter a LEXER_TOKEN, discard the portion that
>      does not match one or more occurances of LEXER_FRAGMENT in the
>      original lexer rule and then parse the results inline starting
>      with root_rule.
> 
> Simple notation, [...]

I still would not call it a bug ... still a suggestion to be thought about - at least one (they - the ANTLR people; we - the ANTLR users) must provide easy idioms how to do this if it is not supported internally by ANTLR.


> In your example, you wrote:
>     : ATTRIBUTE^ ID OF! ID COLON! ENTITY! IS! expression SEMI! ;
> took me a minute, then I realized you were stripping away the
> unnecessary cruft using "!".   Although I was aware of "!", I hadn't
> thought of using it that aggressively, yet.    

... and probably Terence kills me for this - instead of writing

: ATTRIBUTE i1=ID OF! i2=ID COLON ENTITY IS expression SEMI
   -> ^(ATTRIBUTE $i1 $i2 expression)

The many !s grew "historically" - first, I thought I'd keep everything except of and is, then I say the SEMI, then I removed ENTITY also ...

> I would have been done by now if I had written a recursive descent
> parser by hand.   I have been trying to learn antlr in the hope
> that, having done so, I would be able to do similar things faster.
> In light of my experiences so far, that hope is fading fast.

To some degree, I have to agree with you: Even though Terence has written a great book, it is much too "main-stream compiler oriented." Many many aspects of what one can do with languages (and what languages can do to one) are not covered anywhere - even the Wiki is sparse on such issues. In the last 2 weeks, I have written maybe 20 smaller ANTLR grammars in trying to answer various questions on this mailing list. Maybe I should just throw them into the Wiki as-is (but of course with some comments) ... with the double risk that people like you are even more overwhelmed; and that people more in the knowing than I tear my "solutions" apart - but at least there'd be somewhat more to find ...

> 
> [bug] The problem isn't so much the individual bugs but the fact
> that every most attempts to workaround an existing bug is
> compounded by additional bugs.   

In one of my first postings here, I got the (IMHO somewhat harsh) answer (I paraphrase) "you will see that it can be done in ANTLR 3 if you just learn to think the ANTLR 3 way ..." - and I just were inclined to answer the same way. However, this is not a solution. When someone (I, maybe) finds time, we could/should add your list of "bugs" to the Wiki and give some useful explanation on how you could have proceeded differently (and, in the best of all worlds, also explain the troubles yo might have encountered then ...); I'll at least try to do this a little bit.

> The tree of compounding bugs below is
> only a partial list:
>    - bug: antlr doesn't handle parsing inside strings
>      workaround: switch the input by messing with the internals
>      - bug: antlrworks doesn't suport debugging C code
>      - workaround: code the dirty tricks in Java as well as C
>        - bug: Java runtime lacks InnerParser like functionality
>          (not that innerparser itself is quite right for the job
>          since it requires separate file).
>          workaround: try to duplicate the functionaly using the
>          include file example.
>           - bug: the include file example only works in the lexer
>               and you can't pass arguments to lexer rules.   Need
>               parser context to enable and disable dirty tricks.
>             workaround: try to set a variable
>               - bug: variables aren't shared between lexer and parser
>                 workaround: use scope
>                   - bug: that doesn't work
>                 workaround: use @members
>                   - bug: that doesn't work
>                 workaround: use a global variable (not reentrant)
>                    - bug: straightjacketed java language doesn't allow
>                      globals:
>                      workaround: declare a global class in @header
>                          - bug: it puts the header before the imports
>                            and Java won't let you declare a class before
>                             the imports
>                workaround: try to put the variable in the lexer and
>                access it from the parser by following the class
>                chain from lexer to parser
>                   - bug: antlr doesn't provide ready access to
>                     the current parser and lexer.  PARSER->, LEXER->
>                     Workaround: follow the class chain
>                      - bug: antlr runtime doesn't preserve class chain
>                        linkages, for example in TokenSource so
>                        you can't get from the parser class to the
>                        lexer class public members.
>                        workaround: try to go from lexer to parser
>                           - bug: doesn't look like that will work, either.
>            Workaround: try to use rule arguments
>               - bug: you can't pass rule arguments to lexer rules.
> 
>        - bug: antlrworks doesn't support multiple target language actions
>            or conditional inclusion
>          workaround: preprocess grammar file
>             - bug: antlrworks doesn't play nicely with make
>              workaround:  use a weird syntax
>      - workaround: define two parser rules that parse the
>        same token with different actions based on context
>         - bug: can't communicate parser to lexer
>           workaround: use two lexer rules that match the same token
>           but are never called in the same parser context
>            - bug: antlr can't pick the lexer rule based on context
>    workaround: set parse_inside_string in token rules for tokens
>          which match before the string to be parsed, even though
>          this violates one of ANTLRs primary selling points of not
>          having to define tokens seperately.
>         - bug: this
>           PIN_MAP_STRING: 'PIN_MAP_STRING' {
>                parse_inside_string=true;
>           };
>           Generates this:
>             "mismatched input 'PIN_MAP_STRING' expecting PIN_MAP_STRING"
>           But 'PIN_MAP_STRING' in a parser rule does not.
>           Workaround: move IDENTIFIER to bottom of file since ANTLR
>           handles precedence by order in file rather than an explicit
>           precedence.
> 
> [bug]When parsing inside a string, line numbers printed are wrong and
> it doesn't print the number the string (or include file) was included
> from.

THIS is definitely a consequence of manually implemented nested lexers and parsers - but it should be easy to fix this (telling a lexer that it should start with file/line so-and-so) ... maybe someone knows how to do this???

> 
> Anyway, I finally got it to parse the entire file.

What can I say?

Best regards
Harald

-- 
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail


More information about the antlr-interest mailing list