[antlr-interest] How to retrieve free-form text between delimiters?

Mon Jul 23 02:47:50 PDT 2007

On 7/23/07, Andrew Lentvorski <bsder at allcaps.org> wrote:
> Thomas Brandon wrote:
> > On 7/23/07, Andrew Lentvorski <bsder at allcaps.org> wrote:
> >> Any advice for solving this?
> > A . does match anything but in a parser this means any token not any
> > character, so as the only thing your lexer matches is digits and
> > whitespace anything else is an error. You either need to move your
> > date and comment rules to the lexer or make the lexer return tokens
> > for any input that can occur in dates and comments. If you add a lexer
> > rule after other rules like:
> > ANY: .;
> > Then your example should work. However, depending on how you want to
> > process input moving the rules to the lexer may be a better option.
>
> Are there any examples of this I could look at?
Well, you'd want something like:
declaration_command
       : DATE_DCMD | COMMENT_DCMD
       ;

DATE_DCMD:      '$date' (.)* '$end' {System.out.println("D:"+$text);} ;

COMMENT_DCMD:   '$comment' (.)* '$end' {System.out.println("C:"+$text);};

>
> What are the up/downsides of using an ANY vs. moving this back further
> into the parser?
Adding an ANY means that each comment\date will consist of many tokens.
Moving it to the lexer means you will have a single token for the
entire date or comment, including the delimiters complicating any
processing of it.
If the structure of a part of your language is not known or you do not
need to analyse the structure then handling it as a single lexer token
is a good path. For instance actions in ANTLR are handled in such a
way as ANTLR does not know much about their structure. Similarly
comments are handled in the lexer as they have no relevant structure.
Given the language defines no format for dates or comments they are
probably best handled in the lexer. Or if you want to add support for
possible date formats then you might want to move it into the parser
but using standard tokens not an ANY token (i.e. add ID, ';' and ','
rules to the lexer).
It's more a matter of personal preference but I would avoid a catch
all ANY rule as it is not a meaningful tokenisation of the input and
will push some lexer errors into the parser (instead of an invalid
character in the lexer you will have an invalid token in the parser
that could be in any rule).

Tom.
>
> If it helps, the format of the file (complete with kinda funky grammar
> description as an image, of all things) is here:
> http://www-ee.eng.hawaii.edu/~msmith/ASICs/HTML/Verilog/LRM/HTML/15/ch15.2.htm
>
> It's not that complicated.  I have built hand/regex parsers for it
> before, but I wanted to actually try out ANTLRWorks and ANTLR on a real
> problem rather than just toy stuff.
>
> -a
>