[antlr-interest] Extract C Function Definitions Using Parser

Eric researcher0x00 at gmail.com
Mon Mar 19 17:34:41 PDT 2012


On Mon, Mar 19, 2012 at 7:10 PM, Joshua Garcia <joshuaga at usc.edu> wrote:

> Thanks for the suggestion, Eric!
>
> I've actually been using the text attribute of the function_definition
> rule in order to get the text I need.
>
>
>
Great, you started in the right direction.


> However, the grammar does not seem to be complete enough. That is why I
> have been expanding it a bit.


That's how I would do it. Keep breaking down certain grammar rules into
smaller rules until you get the detail you need in the AST.


> Will creating an AST be better than this?
>
>
>
No, I wouldn't use the AST to break down text into new tree nodes.


> If so, I'm not sure why that is, can you explain?
>
If I can get the grammar to break apart or combine terms (strings and
characters) in preparation for the AST then I do it by modifying the
grammar. I never try to do manipulation of the text in the tokens once they
hit the AST. I then use the AST transformations to rearrange, prune, or
insert imaginary nodes into the tree.

>
> Also, do you or anyone else have any suggestions about dealing with the
> issue I'm experiencing due to pre-processed functions being turned into
> extern declarations? I need to pre-process the code from a version of bash
> in order for the C grammar to process it. However, the pre-processor
> transforms certain functions into extern declarations, which removes the
> text I need.
>
 I haven't done C in ages so what you say sounds right, the pieces just
aren't falling into place for me. The only thought I have is to process
both the pre-processed file and the post-processed file and then merge the
results.


>
> I'm thinking I'll have to use a C pre-processor grammar. I've tried this
> one by Youngki KU, which is listed in the grammar list on the ANTLR site:
>
> http://www.antlr.org/grammar/1166665121622/Cpp.tar
>
> Unfortunately, I've made it as far as having to rename CppTreeTreeParser
> identifiers in the generated code to CppTree. However, at that point,
> certain objects like Token, RuleReturnScope, arg, etc. that are
> instantiated in certain functions of CppParser.java are creating errors
> using ANTLR 3.4. It's even worse than that when I try ANTLR 2.7.7. I tried
> using 2.7.7 because the pre-processor grammar is from 2006. Also, if it
> helps, I'm using Oracle's java version 1.6.0_26 on Ubuntu 11.10.
>
The only thing I can say about that is that I would avoid ANTLR 2.x as I
don't think anybody answers questions on it here anymore. Quite a few of
the advanced people are moving onto ANTLR 4.x with 3.x becoming a memory
and 2.x as a distant memory. I won't even look at anything 2.x so that it
doesn't mess with my brain.

If this is a funded project, you can always get paid support:
http://www.antlr.org/support.html


>
> Does anyone know what I have to do to get this old grammar to work, if
> there's no better way to do what I'm trying to do?
>
> Thanks,
> Josh
>
>
> On Mon, Mar 19, 2012 at 5:02 AM, Eric <researcher0x00 at gmail.com> wrote:
>
>> Hi Josh,
>>
>> Here is what I would try.
>>
>> The grammar should be creating an AST and the grammar has a
>> function_definition rule. I would use the function_definition rule to find
>> the start and end tokens making up the function and then if the tokens have
>> the start and end line and positions set, I would use those as a quick test
>> to see if I am correctly getting the functions as needed. If so, then do a
>> more advanced version pruning out the parts of the AST that aren't needed
>> in functions and reconstruct the functions from the tokens in the AST.
>>
>> Some good general advice is:
>>
>> "A common problem with novices attempting to implement language analysis
>> is to believe that their task is simplified by moving sophisticated tasks
>> to conceptually simple tasks. They will try to simplify semantic analysis
>> by creating a more detailed syntactic analysis and syntactic analysis by
>> creating a more detailed lexical analysis. Almost invariably they discover
>> that this attempt is fruitless and has to be undone, because it results in
>> poor error reporting, runs into conflicts as the implementation becomes
>> more complete, duplicates functionality in the later portions of the
>> analysis, and is hard to maintain." By William Clodius
>>
>> In this case, let the parser do what it is best at, making sure the input
>> is valid and creating an AST. Don't create a pruned AST with the parser,
>> let the full AST pass onto another phase for AST analysis and
>> transformations. Let the AST transformations do the work, don't put an
>> additional burden on the parser of filtering out the functions.
>>
>> Hope that helps, Eric
>>
>> On Sun, Mar 18, 2012 at 11:55 PM, Joshua Garcia <joshuaga at usc.edu> wrote:
>>
>>> Hi Everyone,
>>>
>>> I've been working on modifying an ANTLR C grammar so that it produces a
>>> parser that simply outputs function definitions it recognizes to
>>> different
>>> files. I need to do this in order to apply some information retrieval
>>> techniques to C source code.
>>>
>>> Is there a way to get the generated parser to recognize only the function
>>> definitions (including the function body) and comments while ignoring
>>> everything else? I've found it too troublesome to deal with comments so
>>> I've been ignoring them for now.
>>>
>>> If not, is there a way to get the generated parser to recognize only the
>>> function definitions (including the function body) and ignore everything
>>> else? I've been able to modify the grammar so that it can recognize a
>>> large
>>> majority of the functions in pre-processed files of a version of bash.
>>> However, the pre-processed files tend to transform some function
>>> definition
>>> text to extern declarations. Therefore, I lose function definition text
>>> that I need. Furthermore, the parser does not ignore everything else
>>> that's
>>> not part of a function definition, but instead, I've added rules to the
>>> grammar in order to recognize as much of the bash version I'm parsing as
>>> possible.
>>>
>>> In particular, I've been trying to use this grammar:
>>>
>>> http://www.antlr.org/grammar/1153358328744/C.g
>>>
>>> Thanks,
>>> Josh
>>>
>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe:
>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>>
>>
>>
>


More information about the antlr-interest mailing list