[antlr-interest] Reading block of arbitrary text delimited by curly braces

Fri Jul 20 11:18:21 PDT 2012

Don't use the greedy = false as you are using ~'}' anyway, which will
exit.

Also this:

(' '|'\t'|'\r'|'\n')+)=>' '

isn't needed because you only accept a space anyway and there must be at
least one space. So, the rule is:

BLOCK
         : 'BLOCK'
              (' '|'\t'|'\r'|'\n')+
                (  '{')=>'{'
                  | { error(MISSING_LBRACE);}
                )
                (~'}')*
                (  '}')=>'}'
                  | { error(MISSING_RBRACE);}
                )
         ;

Make sure you use new lines and balance the parens so you can see what the
rule is doing visually.

As I said earlier - trying to signal the parser from the lexer does not
work because the lexer will generally scan everything, then the parser
will run. You need the LEXER to issue the error message, not the parser
(where it is too late).

Generally, the lexer and parser should share the same error accumulation
code - the lexer adds any errors and warnings, then the parser can run and
add any errors and warnings, then if none of the errors are fatal, then
tree walker or whatever runs and adds any errors and warnings. Then when
the run completes successfully or not, you iterate the error and warning
messages and print them out.

To repeat, you cannot reliably signal the parser from the lexer - this is
an age old problem with recognizers.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Burton Samograd
> Sent: Friday, July 20, 2012 10:29 AM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Reading block of arbitrary text delimited
> by curly braces
>
> I've gotten a bit further, but again I've hit a brick wall. Following
> your advice below I now have:
>
> BLOCK
>         : 'BLOCK' ((' '|'\t'|'\r'|'\n')+)=>' '
>         ( ('{')=>'{' | { errorFlag = true; } )
>         ( options { greedy = false; } : (~'}')*) '}'
>         ;
>
> In my parser, My grammar is now like:
>
> Job : JOB ID LCURLY BLOCK {
>     If(errorFlag) {
>         throw Exception;
>     }
> } RCURLY
>
> This technique works fine with a single JOB block, but with multiple
> JOB blocks it looks like the whole file is being slurped up by the
> lexer prior to being passed to the parser, causing my errorFlag to be
> set due to an invalid BLOCK at the end of the file before it even gets
> to the JOB parsing rule for the first job block.
>
> I'm used to bison where tokens are read one at a time as needed; does
> antlr not work this way?  Is throwing the exception directly from the
> lexer as it happens the only way around this? I am throwing the
> exception from the parser so I can give a better error message that
> includes the job name and line number of the BLOCK; is the current line
> number available from the lexer?
>
> --
> Burton Samograd
>
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Jim Idle
> Sent: Wednesday, July 18, 2012 1:25 PM
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Reading block of arbitrary text delimited
> by curly braces
>
> No, it is just saying that the next part of the rule can eat that too,
> but it will do the right thing.
>
> You can lose the warning:
>
>             (
>                 ('{')=>'{'
>               | { error("Missing opening brace for BLOCK"); }
>             )
>
>
> And you can do that with any other warnings in the rule.
>
> I use this technique all the time.
>
> Jim
>
>
> > -----Original Message-----
> > From: Burton Samograd [mailto:burton.samograd at markit.com]
> > Sent: Wednesday, July 18, 2012 11:44 AM
> > To: Jim Idle
> > Cc: antlr-interest at antlr.org
> > Subject: RE: [antlr-interest] Reading block of arbitrary text
> > delimited by curly braces
> >
> > Good idea but giving the ( '{' | ... ) alternative gives me multiple
> > alternative warnings/errors, possibly because we already have LCURLY
> > defined as a lexer token:
> >
> > warning(200): SDL.g:869:35: Decision can match input such as "'{'"
> > using multiple alternatives: 1, 2
> >
> > --
> > Burton Samograd
> >
> > -----Original Message-----
> > From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> > bounces at antlr.org] On Behalf Of Jim Idle
> > Sent: Wednesday, July 18, 2012 11:34 AM
> > Cc: antlr-interest at antlr.org
> > Subject: Re: [antlr-interest] Reading block of arbitrary text
> > delimited by curly braces
> >
> > You will have to handle this in the lexer - you are trying to perform
> > syntax driven lexing and this requires context and communication
> > between the parser and the lexer and is either not going to work at
> > all, or will fail in seemingly strange ways.
> >
> >
> > BLOCK: 'BLOCK'
> >        (
> >            (
> >                '{'
> >              | { error("Missing opening brace for BLOCK"); }
> >            )
> >
> > { /* Could set token start here */ }
> >
> >               (~'}')*
> >
> > { /* Could set token end here by calling emit(); }
> >
> >                  (   '}'  // Good
> >                    | { error("Missing closing brace"); }
> >                  )
> >        )
> > ;
> >
> > You might need to tweak the above for your needs, but you are not
> > going to make this work correctly from the parser. You could fake
> > lexer states so that you get more than one token in the stream, but
> > your errors are so simple, that I personally would not bother.
> >
> > Jim
> >
> > > -----Original Message-----
> > > From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> > > bounces at antlr.org] On Behalf Of Burton Samograd
> > > Sent: Wednesday, July 18, 2012 9:50 AM
> > > To: Stephen Siegel
> > > Cc: antlr-interest at antlr.org
> > > Subject: Re: [antlr-interest] Reading block of arbitrary text
> > > delimited by curly braces
> > >
> > > To clarify why pulling in the block as a whole token was not ideal,
> > we
> > > did have it working that way but an issue was presented where we
> > would
> > > like to give a better error message when the curlies are forgotten.
> > > Initially I tried to create another block matching rule that
> started
> > > with 'BLOCK' and searched for any character that was not a { and
> > > used that in an alternate match rule but it caused issues in other
> > > parts
> > of
> > > the parser which made little sense.  This is why I am looking to
> > break
> > > the block rule out of Its single lexer token implementation if it's
> > > possible.
> > >
> > > --
> > > Burton Samograd
> > >
> > > -----Original Message-----
> > > From: Stephen Siegel [mailto:siegel at udel.edu]
> > > Sent: Wednesday, July 18, 2012 10:15 AM
> > > To: Burton Samograd
> > > Cc: antlr-interest at antlr.org
> > > Subject: Re: [antlr-interest] Reading block of arbitrary text
> > > delimited by curly braces
> > >
> > > Yeah, but maybe it can't work.  I think the fundamental problem is
> > > that what you consider to be a token depends on the state of the
> > > parser, so some communication has to place from the parser to the
> > > lexer, which is weird.  It makes more sense to make the whole
> "BLOCK
> > > {...}" one token, as Mike wrote.  Here is a complete grammer which
> I
> > > ran on some examples and works fine:
> > >
> > > grammar exp;
> > >
> > > file    :       BLOCK* EOF;
> > >
> > > BLOCK   :       'BLOCK' WS* LCURLY ( options {greedy=false;} : . )*
> > > RCURLY
> > >         ;
> > >
> > > LCURLY  :       '{';
> > > RCURLY  :       '}';
> > >
> > > WS  :  (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;}
> > >     ;
> > >
> > >
> > > The "BLOCK {" and "}" do appear in the token text but there is
> > > probably some way to get rid of them.
> > >
> > > On Jul 18, 2012, at 10:55 AM, Burton Samograd wrote:
> > >
> > > > Is this what you are suggesting?
> > > >
> > > > // Global
> > > > bool inBlockData = false;
> > > >
> > > > // Parser
> > > > block
> > > >    : BLOCK LCURLY { inBlockData = true; }  BLOCK_DATA RCURLY {
> > > inBlockData = false; }
> > > >        -> ^(BLOCK BLOCK_DATA)
> > > >    ;
> > > >
> > > > // Lexer
> > > > BLOCK : 'BLOCK' ;
> > > > BLOCK_DATA : { inBlockData }?=> (~'}')* ;
> > > >
> > > > Using this technique gets me a bit further, but I am getting a
> > > > recognition exception when I hit the BLOCK_DATA like it is still
> > > going
> > > > through my original lexer/parser and not collecting the
> BLOCK_DATA
> > > > like I would like it to.
> > > >
> > > > I did some reading on semantic predicates but literature just
> gave
> > > > examples for parser rules so I am not sure if I applied the
> > > > concept
> > > to lexer rules properly.
> > > >
> > > > --
> > > > Burton Samograd
> > > >
> > > > -----Original Message-----
> > > > From: Stephen Siegel [mailto:siegel at udel.edu]
> > > > Sent: Tuesday, July 17, 2012 6:35 PM
> > > > To: Burton Samograd
> > > > Cc: antlr-interest at antlr.org
> > > > Subject: Re: [antlr-interest] Reading block of arbitrary text
> > > > delimited by curly braces
> > > >
> > > > You could use a boolean variable added to the lexer "inBlock".
> > > Initially it is false.  Set it to true just after matching the
> LCURLY
> > > and back to false after matching RCURLY in the block rule.   They
> you
> > > could define the BLOCK_DATA token using inBlock as a guard (I think
> > > that's called a "semantic predicate").  BLOCK_DATA should match
> > > anything EXCEPT RCURLY (I'm assuming you don't want to allow RCURLY
> > in
> > > the block data, or how would you know when the block ends? -- just
> > > like a comment in C, for example.)
> > > > -Steve
> > > >
> > > > On Jul 17, 2012, at 3:57 PM, Burton Samograd wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> We have a requirement where we need to include a block of
> > arbitrary
> > > text in a block, like so:
> > > >>
> > > >> BLOCK { some arbitrary text here }
> > > >>
> > > >> We initially got around this by making the whole block a token,
> > > like:
> > > >>
> > > >> BLOCK : 'BLOCK (' '|'\t'|'\r'|'\n')* '{' (~'}')*  '}' ;
> > > >>
> > > >> but this is less than ideal.  I am thinking that we would use
> > > something like:
> > > >>
> > > >> block : BLOCK RCURLY BLOCK_DATA LCURLY
> > > >>
> > > >> with BLOCK : 'BLOCK' and LCURLY/RCURLY as { and }.
> > > >>
> > > >> I am stuck on specifying BLOCK_DATA which is basically .* to the
> > > lexer.  I have attempted to access the input stream from the parser
> > > RECOGNIZER but have not been able to come up with a solution.
> > > >>
> > > >> I am looking to basically hijack the input stream after seeing a
> > > BLOCK token so I can read the arbitrary text; I can parse out the
> {
> > }
> > > as needed.
> > > >>
> > > >> Is this possible?
> > > >>
> > > >> --
> > > >> Burton Samograd
> > > >>
> > > >> ________________________________
> > > >> This e-mail, including accompanying communications and
> > attachments,
> > > >> is strictly confidential and only for the intended recipient.
> Any
> > > >> retention, use or disclosure not expressly authorised by Markit
> > > >> is prohibited. This email is subject to all waivers and other
> > > >> terms
> > at
> > > >> the following link:
> > > >> http://www.markit.com/en/about/legal/email-disclaimer.page
> > > >>
> > > >> Please visit http://www.markit.com/en/about/contact/contact-
> > us.page?
> > > for contact information on our offices worldwide.
> > > >>
> > > >> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> > > >> Unsubscribe:
> > > >> http://www.antlr.org/mailman/options/antlr-interest/your-email-
> > > addres
> > > >> s
> > > >
> > > >
> > > > This e-mail, including accompanying communications and
> > > > attachments,
> > > is
> > > > strictly confidential and only for the intended recipient. Any
> > > > retention, use or disclosure not expressly authorised by Markit
> is
> > > > prohibited. This email is subject to all waivers and other terms
> > > > at the following link:
> > > > http://www.markit.com/en/about/legal/email-disclaimer.page
> > > >
> > > > Please visit http://www.markit.com/en/about/contact/contact-
> > us.page?
> > > for contact information on our offices worldwide.
> > >
> > >
> > > This e-mail, including accompanying communications and attachments,
> > is
> > > strictly confidential and only for the intended recipient. Any
> > > retention, use or disclosure not expressly authorised by Markit is
> > > prohibited. This email is subject to all waivers and other terms at
> > > the following link: http://www.markit.com/en/about/legal/email-
> > > disclaimer.page
> > >
> > > Please visit http://www.markit.com/en/about/contact/contact-
> us.page?
> > > for contact information on our offices worldwide.
> > >
> > > List: http://www.antlr.org/mailman/listinfo/antlr-interest
> > > Unsubscribe: http://www.antlr.org/mailman/options/antlr-
> > interest/your-
> > > email-address
> >
> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
> > Unsubscribe: http://www.antlr.org/mailman/options/antlr-
> interest/your-
> > email-address
> >
> > This e-mail, including accompanying communications and attachments,
> is
> > strictly confidential and only for the intended recipient. Any
> > retention, use or disclosure not expressly authorised by Markit is
> > prohibited. This email is subject to all waivers and other terms at
> > the following link: http://www.markit.com/en/about/legal/email-
> > disclaimer.page
> >
> > Please visit http://www.markit.com/en/about/contact/contact-us.page?
> > for contact information on our offices worldwide.
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address
>
> This e-mail, including accompanying communications and attachments, is
> strictly confidential and only for the intended recipient. Any
> retention, use or disclosure not expressly authorised by Markit is
> prohibited. This email is subject to all waivers and other terms at the
> following link: http://www.markit.com/en/about/legal/email-
> disclaimer.page
>
> Please visit http://www.markit.com/en/about/contact/contact-us.page?
> for contact information on our offices worldwide.
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address