[antlr-interest] Extract C Function Definitions Using Parser

Mon Mar 19 16:10:46 PDT 2012

Thanks for the suggestion, Eric!

I've actually been using the text attribute of the function_definition rule
in order to get the text I need. However, the grammar does not seem to be
complete enough. That is why I have been expanding it a bit. Will creating
an AST be better than this? If so, I'm not sure why that is, can you
explain?

Also, do you or anyone else have any suggestions about dealing with the
issue I'm experiencing due to pre-processed functions being turned into
extern declarations? I need to pre-process the code from a version of bash
in order for the C grammar to process it. However, the pre-processor
transforms certain functions into extern declarations, which removes the
text I need.

I'm thinking I'll have to use a C pre-processor grammar. I've tried this
one by Youngki KU, which is listed in the grammar list on the ANTLR site:

http://www.antlr.org/grammar/1166665121622/Cpp.tar

Unfortunately, I've made it as far as having to rename CppTreeTreeParser
identifiers in the generated code to CppTree. However, at that point,
certain objects like Token, RuleReturnScope, arg, etc. that are
instantiated in certain functions of CppParser.java are creating errors
using ANTLR 3.4. It's even worse than that when I try ANTLR 2.7.7. I tried
using 2.7.7 because the pre-processor grammar is from 2006. Also, if it
helps, I'm using Oracle's java version 1.6.0_26 on Ubuntu 11.10.

Does anyone know what I have to do to get this old grammar to work, if
there's no better way to do what I'm trying to do?

Thanks,
Josh

On Mon, Mar 19, 2012 at 5:02 AM, Eric <researcher0x00 at gmail.com> wrote:

> Hi Josh,
>
> Here is what I would try.
>
> The grammar should be creating an AST and the grammar has a
> function_definition rule. I would use the function_definition rule to find
> the start and end tokens making up the function and then if the tokens have
> the start and end line and positions set, I would use those as a quick test
> to see if I am correctly getting the functions as needed. If so, then do a
> more advanced version pruning out the parts of the AST that aren't needed
> in functions and reconstruct the functions from the tokens in the AST.
>
> Some good general advice is:
>
> "A common problem with novices attempting to implement language analysis
> is to believe that their task is simplified by moving sophisticated tasks
> to conceptually simple tasks. They will try to simplify semantic analysis
> by creating a more detailed syntactic analysis and syntactic analysis by
> creating a more detailed lexical analysis. Almost invariably they discover
> that this attempt is fruitless and has to be undone, because it results in
> poor error reporting, runs into conflicts as the implementation becomes
> more complete, duplicates functionality in the later portions of the
> analysis, and is hard to maintain." By William Clodius
>
> In this case, let the parser do what it is best at, making sure the input
> is valid and creating an AST. Don't create a pruned AST with the parser,
> let the full AST pass onto another phase for AST analysis and
> transformations. Let the AST transformations do the work, don't put an
> additional burden on the parser of filtering out the functions.
>
> Hope that helps, Eric
>
> On Sun, Mar 18, 2012 at 11:55 PM, Joshua Garcia <joshuaga at usc.edu> wrote:
>
>> Hi Everyone,
>>
>> I've been working on modifying an ANTLR C grammar so that it produces a
>> parser that simply outputs function definitions it recognizes to different
>> files. I need to do this in order to apply some information retrieval
>> techniques to C source code.
>>
>> Is there a way to get the generated parser to recognize only the function
>> definitions (including the function body) and comments while ignoring
>> everything else? I've found it too troublesome to deal with comments so
>> I've been ignoring them for now.
>>
>> If not, is there a way to get the generated parser to recognize only the
>> function definitions (including the function body) and ignore everything
>> else? I've been able to modify the grammar so that it can recognize a
>> large
>> majority of the functions in pre-processed files of a version of bash.
>> However, the pre-processed files tend to transform some function
>> definition
>> text to extern declarations. Therefore, I lose function definition text
>> that I need. Furthermore, the parser does not ignore everything else
>> that's
>> not part of a function definition, but instead, I've added rules to the
>> grammar in order to recognize as much of the bash version I'm parsing as
>> possible.
>>
>> In particular, I've been trying to use this grammar:
>>
>> http://www.antlr.org/grammar/1153358328744/C.g
>>
>> Thanks,
>> Josh
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:
>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>
>