[antlr-interest] antlr-interest Digest, Vol 79, Issue 19

Sat Jun 18 13:47:26 PDT 2011

Thinking out laud I suppose that you can make more than one ANTLR grammar files, one for the high level structure that separates the file into Title and Body for example, and then one more grammar for the title section and another for the body. You only have to write a simple class to integrate them all together. I actually think that it would be nice if ANTLR supports that out of the box. It would make a lot of things much simplified. Actually I think that it is not hard at all to implement a generic solution for this sort of approaches.

________________________________
From: "antlr-interest-request at antlr.org" <antlr-interest-request at antlr.org>
To: antlr-interest at antlr.org
Sent: Saturday, June 18, 2011 12:00 PM
Subject: antlr-interest Digest, Vol 79, Issue 19

Send antlr-interest mailing list submissions to
    antlr-interest at antlr.org

To subscribe or unsubscribe via the World Wide Web, visit
    http://www.antlr.org/mailman/listinfo/antlr-interest
or, via email, send a message with subject or body 'help' to
    antlr-interest-request at antlr.org

You can reach the person managing the list at
    antlr-interest-owner at antlr.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of antlr-interest digest..."

Today's Topics:

   1. Re: Context-sensitive lexer (Jonas)
   2. Re: Context-sensitive lexer (Bart Kiers)
   3. Re: Context-sensitive lexer (John B. Brodie)
   4. Re: Question:  ANTLR and LLVM  ...  + Clang (Ruslan Zasukhin)
   5. Re: Question: ANTLR and LLVM ... + Clang (Douglas Godfrey)
   6. Re: Question: ANTLR and LLVM ... + Clang (Ruslan Zasukhin)

----------------------------------------------------------------------

Message: 1
Date: Fri, 17 Jun 2011 23:09:18 +0200
From: Jonas <jonas.hagmar at gmail.com>
Subject: Re: [antlr-interest] Context-sensitive lexer
To: antlr-interest at antlr.org
Message-ID: <BANLkTimnXHNgDRwYPaSQ8A3i6s9HDMuujA at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Hi Bart,

Thank you for the excellent input on the problem. I hope your approach
can be adapted to overcome all the difficulties coming from the
context sensitivity of the file format I have to deal with. For
example, the title text can be any character sequence, leading to a
definition of your WORD token that I fear might clash with patterns
needed to pick out identifiers in, e.g., algebraic expressions later
in the file. Moreover, the whitespace in the title text is actually
significant. If the title text is "foo$3        bar__!" (without the
quotes), that is exactly what the user expects to see when using the
program reading the file. In other places, whitespace acts like a list
separator, and in some places it should just be ignored. With your
approach, wouldn't that mean that I have to include the whitespace in
all relevant parser rules, even when it should be ignored?

As an alternative, I am considering using a JFlex lexer, which can
easily handle lexer state, coupled with an ANTLR parser and tree
parser. I have almost figured out how to do that, but to really get it
flying, it would be great to be able to run the ANTLRWorks debugger on
the resulting lexer-parser combination. I have seen some posts saying
that this is possible, but not how to do it. If I don't figure it out
myself, I might post a separate question regarding that.

Best Regards,
Jonas

On Fri, Jun 17, 2011 at 8:37 PM, Bart Kiers <bkiers at gmail.com> wrote:
> Hi Jonas,
> I would not put so much responsibility inside the lexer. This is really the
> task of the parser.
> How about something like this:
>
> grammar test;
> options {
> ? output=AST;
> }
> tokens {
> ? FILE;
> ? SECTIONS;
> ? LINE;
> }
> parse
> ? : ?title (section NL)+ EOF -> ^(FILE title ^(SECTIONS section+))
> ? ;
> title
> ? : ?TITLE NL (anyWord+ NL)+ NL -> ^(TITLE anyWord+)
> ? ;
> section
> ? : ?SECTION NL (anyWordExceptEnd+ NL)+ END NL -> ^(SECTION
> anyWordExceptEnd+)
> ? ;
>
> anyWordExceptEnd
> ? : ?WORD
> ? | ?SECTION
> ? | ?TITLE
> ? ;
> anyWord
> ? : ?anyWordExceptEnd
> ? | ?END
> ? ;
>
> SECTION
> ? : ?'SECTION' '0'..'9'+
> ? ;
> END
> ? : ?'END'
> ? ;
> TITLE
> ? : ?'TITLE'
> ? ;
> WORD
> ? : ?('a'..'z' | 'A'..'Z')+
> ? ;
>
> NL
> ? : ?'\r'? '\n'
> ? | ?'\r'
> ? ;
>
> SPACE
> ? : ?(' ' | '\t') {$channel=HIDDEN;}
> ? ;
>
> A small test class:
>
> import org.antlr.runtime.*;
> import org.antlr.runtime.tree.*;
> import org.antlr.stringtemplate.*;
> public class Main {
> ? public static void main(String[] args) throws Exception {
> ? ? String source =
> ? ? ? ? "TITLE ? ? ? ? ? ?\n" +
> ? ? ? ? "some ? ? ? ? ? ? \n"+
> ? ? ? ? "title ? ? ? ? ? ?\n"+
> ? ? ? ? "text ? ? ? ? ? ? \n" +
> ? ? ? ? " ? ? ? ? ? ? ? ? \n" +
> ? ? ? ? "SECTION1 ? ? ? ? \n" +
> ? ? ? ? " a b ? ? ? ? ? ? \n" +
> ? ? ? ? " c ? ? ? ? ? ? ? \n" +
> ? ? ? ? "END ? ? ? ? ? ? ?\n" +
> ? ? ? ? " ? ? ? ? ? ? ? ? \n" +
> ? ? ? ? "SECTION2 ? ? ? ? \n" +
> ? ? ? ? " ?SECTION2 text ?\n" +
> ? ? ? ? "END ? ? ? ? ? ? ?\n" +
> ? ? ? ? " ? ? ? ? ? ? ? ? \n" +
> ? ? ? ? "SECTION3 ? ? ? ? \n" +
> ? ? ? ? " ?more text ? ? ?\n" +
> ? ? ? ? "END ? ? ? ? ? ? ?\n" +
> ? ? ? ? "\n";
> ? ? testLexer lexer = new testLexer(new ANTLRStringStream(source));
> ? ? testParser parser = new testParser(new CommonTokenStream(lexer));
> ? ? CommonTree tree = (CommonTree)parser.parse().getTree();
> ? ? DOTTreeGenerator gen = new DOTTreeGenerator();
> ? ? StringTemplate st = gen.toDOT(tree);
> ? ? System.out.println(st);
> ? }
> }
>
> will produce the AST attached to this message.
> Regards,
> Bart.
>
>
> On Fri, Jun 17, 2011 at 2:15 PM, Jonas <jonas.hagmar at gmail.com> wrote:
>>
>> Hi,
>>
>> I'm developing a parser for a file format where context is very
>> important. I'm looking to
>> 1) understand why my ANTLR parser gets into infinite loops
>> 2) find out if there is any better way to implement context
>> sensitivity than what I am doing with semantic predicates.
>>
>> A typical beginning of a file looks like this:
>> TITLE
>> some title text
>>
>> SECTION1
>> ?a=b*c
>> END
>>
>> SECTION2
>> ...
>>
>> SECTION3
>> ...
>>
>> The syntax differs from section to section; the 'TITLE' section is
>> terminated by the newline after the title text line, while other
>> sections can e.g. use single quote string literals and be terminated
>> by a keyword like 'END'. Here is a sample grammar, that gets into an
>> infinite loop:
>>
>> grammar test;
>>
>> options {
>> ?output=AST;
>> }
>>
>> @lexer::members {
>> ?static final int STATE_AT_BEGINNING = 0;
>> ?static final int STATE_IN_TITLE = 1;
>> ?static final int STATE_AFTER_TITLE = 2;
>> ?int lexerState = STATE_AT_BEGINNING;
>> }
>>
>> file ? ?: ? ? ? title;
>>
>> title ? : ? ? ? BEGIN_TITLE TITLE_TEXT END_TITLE;
>>
>> BEGIN_TITLE
>> ? ? ? ?: {(lexerState == STATE_AT_BEGINNING)}? 'TITLE' WS_NL
>> {lexerState=STATE_IN_TITLE;}
>> ? ? ? ?;
>>
>> TITLE_TEXT
>> ? ? ? ?: {lexerState == STATE_IN_TITLE}? TEXT
>> ? ? ? ?;
>>
>> END_TITLE
>> ? ? ? ?: {lexerState == STATE_IN_TITLE}? NL
>> {lexerState=STATE_AFTER_TITLE;}
>> ? ? ? ?;
>>
>> BLANK_ROW
>> ? ? ? ?: {!(lexerState == STATE_IN_TITLE)}? WS_NL
>> ? ? ? ?;
>>
>> REMARK ?: {!(lexerState == STATE_IN_TITLE)}? 'REMA' .* NL
>> ? ? ? ?;
>>
>> fragment
>> WS_NL ? : ? ? ? (' ' | '\t')* NL;
>>
>> fragment
>> NL ? ? ?: ? ? ? '\r'? '\n';
>>
>> fragment
>> TEXT ? ?: ? ? ? (~('\r' | '\n'))*;
>>
>> Best Regards,
>> Jonas
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:
>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>

------------------------------

Message: 2
Date: Fri, 17 Jun 2011 23:20:34 +0200
From: Bart Kiers <bkiers at gmail.com>
Subject: Re: [antlr-interest] Context-sensitive lexer
To: Jonas <jonas.hagmar at gmail.com>
Cc: antlr-interest at antlr.org
Message-ID: <BANLkTikpj46J9bcM8dT6j-R9snaLd8foRA at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Hi Jonas,

On Fri, Jun 17, 2011 at 11:09 PM, Jonas <jonas.hagmar at gmail.com> wrote:

> Hi Bart,
>
> Thank you for the excellent input on the problem. I hope your approach
> can be adapted to overcome all the difficulties coming from the
> context sensitivity of the file format I have to deal with. For
> example, the title text can be any character sequence, leading to a
> definition of your WORD token that I fear might clash with patterns
> needed to pick out identifiers in, e.g., algebraic expressions later
> in the file. Moreover, the whitespace in the title text is actually
> significant. If the title text is "foo$3        bar__!" (without the
> quotes), that is exactly what the user expects to see when using the
> program reading the file. In other places, whitespace acts like a list
> separator, and in some places it should just be ignored. With your
> approach, wouldn't that mean that I have to include the whitespace in
> all relevant parser rules, even when it should be ignored?

I'm not sure what you all mean by all that, sorry. My post was more meant to
emphasize my point of _not_ doing so much inside the lexer.

Perhaps you'd like to post a more detailed explanation of the language
you're trying to parse?

Regards,

Bart.

------------------------------

Message: 3
Date: Fri, 17 Jun 2011 17:28:00 -0400
From: "John B. Brodie" <jbb at acm.org>
Subject: Re: [antlr-interest] Context-sensitive lexer
To: Jonas <jonas.hagmar at gmail.com>
Cc: antlr-interest at antlr.org
Message-ID: <1308346080.20408.4.camel at gecko>
Content-Type: text/plain; charset="UTF-8"

On Fri, 2011-06-17 at 15:29 +0200, Jonas wrote:
> Hi John!
> 
> I believed that using the semantic predicate would hinder ANTLR from
> trying to match TITLE_TEXT in other situations than when lexerState
> indicates that we are in the title section. Anyway, changing the TEXT
> fragment to (~('\r' | '\n'))+ does not prevent the infinite loop. Keep
> the good ideas coming!

When I run your example from the command line I get this message printed
to the console continuously...

line 4:0 rule TITLE_TEXT failed predicate: {lexerState==STATE_IN_TITLE}?

perhaps predicates in the Lexer do not actually perform as you are
expecting? (look at the generated lexer code....)

> 
> Best Regards,
> Jonas
> 
> On Fri, Jun 17, 2011 at 3:06 PM, John B. Brodie <jbb at acm.org> wrote:
> > Greetings!
> >
> > Your TEXT fragment (and therefore your TITLE_TEXT token) can be empty!
> >
> > Thus, I think your lexer is trying to recognize infinitely many
> > TITLE_TEXT tokens.
> >
> > Hope this helps...
> >   -jbb
> >
> > On Fri, 2011-06-17 at 14:15 +0200, Jonas wrote:
> >> Hi,
> >>
> >> I'm developing a parser for a file format where context is very
> >> important. I'm looking to
> >> 1) understand why my ANTLR parser gets into infinite loops
> >> 2) find out if there is any better way to implement context
> >> sensitivity than what I am doing with semantic predicates.
> >>
> >> A typical beginning of a file looks like this:
> >> TITLE
> >> some title text
> >>
> >> SECTION1
> >>  a=b*c
> >> END
> >>
> >> SECTION2
> >> ...
> >>
> >> SECTION3
> >> ...
> >>
> >> The syntax differs from section to section; the 'TITLE' section is
> >> terminated by the newline after the title text line, while other
> >> sections can e.g. use single quote string literals and be terminated
> >> by a keyword like 'END'. Here is a sample grammar, that gets into an
> >> infinite loop:
> >>
> >> grammar test;
> >>
> >> options {
> >>   output=AST;
> >> }
> >>
> >> @lexer::members {
> >>   static final int STATE_AT_BEGINNING = 0;
> >>   static final int STATE_IN_TITLE = 1;
> >>   static final int STATE_AFTER_TITLE = 2;
> >>   int lexerState = STATE_AT_BEGINNING;
> >> }
> >>
> >> file  :       title;
> >>
> >> title :       BEGIN_TITLE TITLE_TEXT END_TITLE;
> >>
> >> BEGIN_TITLE
> >>       : {(lexerState == STATE_AT_BEGINNING)}? 'TITLE' WS_NL
> >> {lexerState=STATE_IN_TITLE;}
> >>       ;
> >>
> >> TITLE_TEXT
> >>       : {lexerState == STATE_IN_TITLE}? TEXT
> >>       ;
> >>
> >> END_TITLE
> >>       : {lexerState == STATE_IN_TITLE}? NL {lexerState=STATE_AFTER_TITLE;}
> >>       ;
> >>
> >> BLANK_ROW
> >>       : {!(lexerState == STATE_IN_TITLE)}? WS_NL
> >>       ;
> >>
> >> REMARK        : {!(lexerState == STATE_IN_TITLE)}? 'REMA' .* NL
> >>       ;
> >>
> >> fragment
> >> WS_NL :       (' ' | '\t')* NL;
> >>
> >> fragment
> >> NL    :       '\r'? '\n';
> >>
> >> fragment
> >> TEXT  :       (~('\r' | '\n'))*;
> >>
> >
> >
> >
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

------------------------------

Message: 4
Date: Sat, 18 Jun 2011 12:19:50 +0300
From: Ruslan Zasukhin <ruslan_zasukhin at valentina-db.com>
Subject: Re: [antlr-interest] Question:  ANTLR and LLVM  ...  + Clang
To: "antlr-interest at antlr.org" <antlr-interest at antlr.org>
Message-ID: <CA224866.ECD41%ruslan_zasukhin at valentina-db.com>
Content-Type: text/plain;    charset="US-ASCII"

On 6/17/11 8:22 PM, "Kevin J. Cummings" <cummings at kjchome.homeip.net> wrote:

Hi Kevin,

Well, don't know why you think they cannot be compared.

ANTLR - is Parser -> AST  ->TreeParser

Clang 
    contains also parser -- own, seems to be hand-made,
    then they have more logic phases.

On this page very good explained how C++ FrontEnd is bigger
of parser

   http://www.semanticdesigns.com/Products/FrontEnds/CppFrontEnd.html

So again, if we have task to proceed C++ sources, we may choose between:

1)  ANTLR and develop or use some C++ grammar,
     then spend time on (all/some) features describe on above page

2) take in hands complete C++ Frontend and ...DONE?
    For now I see two strong enough such frontends.
    Clang and SemanticDesign (which I cannot test it seems as demo).

=============
> ANTLR is a tool which can help you build compiler front-ends.  If you
> were industrious enough, you could re-write CLang using ANTLR.
> 
> ANTLR is primarily a JAVA tool (you at least need JAVA to run the tool
> to compile your grammar), but can be used to produce other targeted
> languages (C/C++, Python, etc) for your actual front-end.  While the C++
> support is minimal in version 3 (better in version 2.7, but lacking in
> some of the ST support) resulting in much use of C code which can be
> compiled using C++, you could use it to interface directly to the LLVM
> IR API if you wanted to.  But, I think Ter's example is probably the way
> to go, at least until Version 4 starts to grow and we see what kind of
> C++ runtime support will exist for ANTLR v4.
> 
>> When one should prefer Clang vs ANTLR or reverse?
>> Your opinions?
> 
> I think you are asking the wrong question here.  Please compare apples
> to apples, and not to cucumbers.

-- 
Best regards,

Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc

Valentina - Joining Worlds of Information
http://www.paradigmasoft.com

[I feel the need: the need for speed]

------------------------------

Message: 5
Date: Sat, 18 Jun 2011 12:26:00 -0400
From: Douglas Godfrey <douglasgodfrey at gmail.com>
Subject: Re: [antlr-interest] Question: ANTLR and LLVM ... + Clang
To: Ruslan Zasukhin <ruslan_zasukhin at valentina-db.com>
Cc: "antlr-interest at antlr.org" <antlr-interest at antlr.org>
Message-ID: <BANLkTi=vw03HhqS6fd_cSXEbjnmj208cNA at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

The SemanticDesigns C++ frontend, like all of their frontend(s) is intended
for code analysis and transformation, not compiling.

Semantic Designs' tools are based on the old Reasoning Systems Inc.
tools: Refine
and Intervista.

Semantic Designs' tools parse a source language into an Symbol Table and AST
with more features than
the Antlr AST. The tools then take the Symbol Table and AST and either do
code analysis or reverse compile
the AST into new source code in the same or a different language. The tools
no not interface with a compiler
backend or machine code generator.

On Sat, Jun 18, 2011 at 5:19 AM, Ruslan Zasukhin <
ruslan_zasukhin at valentina-db.com> wrote:

> On 6/17/11 8:22 PM, "Kevin J. Cummings" <cummings at kjchome.homeip.net>
> wrote:
>
> Hi Kevin,
>
> Well, don't know why you think they cannot be compared.
>
> ANTLR - is Parser -> AST  ->TreeParser
>
> Clang
>    contains also parser -- own, seems to be hand-made,
>    then they have more logic phases.
>
> On this page very good explained how C++ FrontEnd is bigger
> of parser
>
>  http://www.semanticdesigns.com/Products/FrontEnds/CppFrontEnd.html
>
>
> So again, if we have task to proceed C++ sources, we may choose between:
>
> 1)  ANTLR and develop or use some C++ grammar,
>     then spend time on (all/some) features describe on above page
>
> 2) take in hands complete C++ Frontend and ...DONE?
>    For now I see two strong enough such frontends.
>    Clang and SemanticDesign (which I cannot test it seems as demo).
>
>
> =============
> > ANTLR is a tool which can help you build compiler front-ends.  If you
> > were industrious enough, you could re-write CLang using ANTLR.
> >
> > ANTLR is primarily a JAVA tool (you at least need JAVA to run the tool
> > to compile your grammar), but can be used to produce other targeted
> > languages (C/C++, Python, etc) for your actual front-end.  While the C++
> > support is minimal in version 3 (better in version 2.7, but lacking in
> > some of the ST support) resulting in much use of C code which can be
> > compiled using C++, you could use it to interface directly to the LLVM
> > IR API if you wanted to.  But, I think Ter's example is probably the way
> > to go, at least until Version 4 starts to grow and we see what kind of
> > C++ runtime support will exist for ANTLR v4.
> >
> >> When one should prefer Clang vs ANTLR or reverse?
> >> Your opinions?
> >
> > I think you are asking the wrong question here.  Please compare apples
> > to apples, and not to cucumbers.
>
> --
> Best regards,
>
> Ruslan Zasukhin
> VP Engineering and New Technology
> Paradigma Software, Inc
>
> Valentina - Joining Worlds of Information
> http://www.paradigmasoft.com
>
> [I feel the need: the need for speed]
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>

------------------------------

Message: 6
Date: Sat, 18 Jun 2011 21:37:10 +0300
From: Ruslan Zasukhin <ruslan_zasukhin at valentina-db.com>
Subject: Re: [antlr-interest] Question: ANTLR and LLVM ... + Clang
To: "antlr-interest at antlr.org" <antlr-interest at antlr.org>
Message-ID: <CA22CB06.ECD8A%ruslan_zasukhin at valentina-db.com>
Content-Type: text/plain;    charset="US-ASCII"

On 6/18/11 7:26 PM, "Douglas Godfrey" <douglasgodfrey at gmail.com> wrote:

Hi Douglas, 

> The SemanticDesigns C++ frontend, like all of their frontend(s) is intended
> for code analysis and transformation, not compiling.

> Semantic Designs' tools are based on the old Reasoning Systems Inc.
> tools: Refine and Intervista.
> 
> Semantic Designs' tools parse a source language into an Symbol Table and AST
> with more features than the Antlr AST.

> The tools then take the Symbol Table and AST and either do
> code analysis or reverse compile
> the AST into new source code in the same or a different language.
> The tools no not interface with a compiler backend or machine code generator.

I see. 

Thanks.

-- 
Best regards,

Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc

Valentina - Joining Worlds of Information
http://www.paradigmasoft.com

[I feel the need: the need for speed]

------------------------------

_______________________________________________
antlr-interest mailing list
antlr-interest at antlr.org
http://www.antlr.org/mailman/listinfo/antlr-interest

End of antlr-interest Digest, Vol 79, Issue 19
**********************************************