[antlr-interest] Reading block of arbitrary text delimited by curly braces

Burton Samograd burton.samograd at markit.com
Wed Jul 18 09:49:39 PDT 2012


To clarify why pulling in the block as a whole token was not ideal, we did have it working that way but an issue was presented where we would like to give a better error message when the curlies are forgotten.  Initially I tried to create another block matching rule that started with 'BLOCK' and searched for any character that was not a { and used that in an alternate match rule but it caused issues in other parts of the parser which made little sense.  This is why I am looking to break the block rule out of Its single lexer token implementation if it's possible.

--
Burton Samograd

-----Original Message-----
From: Stephen Siegel [mailto:siegel at udel.edu]
Sent: Wednesday, July 18, 2012 10:15 AM
To: Burton Samograd
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Reading block of arbitrary text delimited by curly braces

Yeah, but maybe it can't work.  I think the fundamental problem is that what you consider to be a token depends on the state of the parser, so some communication has to place from the parser to the lexer, which is weird.  It makes more sense to make the whole "BLOCK {...}" one token, as Mike wrote.  Here is a complete grammer which I ran on some examples and works fine:

grammar exp;

file    :       BLOCK* EOF;

BLOCK   :       'BLOCK' WS* LCURLY ( options {greedy=false;} : . )* RCURLY
        ;

LCURLY  :       '{';
RCURLY  :       '}';

WS  :  (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;}
    ;


The "BLOCK {" and "}" do appear in the token text but there is probably some way to get rid of them.

On Jul 18, 2012, at 10:55 AM, Burton Samograd wrote:

> Is this what you are suggesting?
>
> // Global
> bool inBlockData = false;
>
> // Parser
> block
>    : BLOCK LCURLY { inBlockData = true; }  BLOCK_DATA RCURLY { inBlockData = false; }
>        -> ^(BLOCK BLOCK_DATA)
>    ;
>
> // Lexer
> BLOCK : 'BLOCK' ;
> BLOCK_DATA : { inBlockData }?=> (~'}')* ;
>
> Using this technique gets me a bit further, but I am getting a
> recognition exception when I hit the BLOCK_DATA like it is still going
> through my original lexer/parser and not collecting the BLOCK_DATA
> like I would like it to.
>
> I did some reading on semantic predicates but literature just gave
> examples for parser rules so I am not sure if I applied the concept to lexer rules properly.
>
> --
> Burton Samograd
>
> -----Original Message-----
> From: Stephen Siegel [mailto:siegel at udel.edu]
> Sent: Tuesday, July 17, 2012 6:35 PM
> To: Burton Samograd
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Reading block of arbitrary text
> delimited by curly braces
>
> You could use a boolean variable added to the lexer "inBlock".  Initially it is false.  Set it to true just after matching the LCURLY and back to false after matching RCURLY in the block rule.   They you could define the BLOCK_DATA token using inBlock as a guard (I think that's called a "semantic predicate").  BLOCK_DATA should match anything EXCEPT RCURLY (I'm assuming you don't want to allow RCURLY in the block data, or how would you know when the block ends? -- just like a comment in C, for example.)
> -Steve
>
> On Jul 17, 2012, at 3:57 PM, Burton Samograd wrote:
>
>> Hello,
>>
>> We have a requirement where we need to include a block of arbitrary text in a block, like so:
>>
>> BLOCK { some arbitrary text here }
>>
>> We initially got around this by making the whole block a token, like:
>>
>> BLOCK : 'BLOCK (' '|'\t'|'\r'|'\n')* '{' (~'}')*  '}' ;
>>
>> but this is less than ideal.  I am thinking that we would use something like:
>>
>> block : BLOCK RCURLY BLOCK_DATA LCURLY
>>
>> with BLOCK : 'BLOCK' and LCURLY/RCURLY as { and }.
>>
>> I am stuck on specifying BLOCK_DATA which is basically .* to the lexer.  I have attempted to access the input stream from the parser RECOGNIZER but have not been able to come up with a solution.
>>
>> I am looking to basically hijack the input stream after seeing a BLOCK token so I can read the arbitrary text; I can parse out the  { } as needed.
>>
>> Is this possible?
>>
>> --
>> Burton Samograd
>>
>> ________________________________
>> This e-mail, including accompanying communications and attachments,
>> is strictly confidential and only for the intended recipient. Any
>> retention, use or disclosure not expressly authorised by Markit is
>> prohibited. This email is subject to all waivers and other terms at
>> the following link:
>> http://www.markit.com/en/about/legal/email-disclaimer.page
>>
>> Please visit http://www.markit.com/en/about/contact/contact-us.page? for contact information on our offices worldwide.
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:
>> http://www.antlr.org/mailman/options/antlr-interest/your-email-addres
>> s
>
>
> This e-mail, including accompanying communications and attachments, is
> strictly confidential and only for the intended recipient. Any
> retention, use or disclosure not expressly authorised by Markit is
> prohibited. This email is subject to all waivers and other terms at
> the following link:
> http://www.markit.com/en/about/legal/email-disclaimer.page
>
> Please visit http://www.markit.com/en/about/contact/contact-us.page? for contact information on our offices worldwide.


This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by Markit is prohibited. This email is subject to all waivers and other terms at the following link: http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page? for contact information on our offices worldwide.


More information about the antlr-interest mailing list