[antlr-interest] Slightly different C-Parser

Stuart Watt SWatt at infobal.com
Wed Feb 6 08:49:57 PST 2008


I had this problem, and in the end walked away from all existing C grammars.
What I developed in the end was a slightly enhanced tokeniser, which matched
most kinds of brackets, and required fairly significant fixing-up which I
did using an XML-based AST, XPath, and some rather nasty Perl code. 
 
The problem is that C type definitions can be required to determine what
kind of declaration path to follow, and C type definitions are missing if
the include files are missing. This is over and above the macro issue. In
the end I came to the conclusion that a C parser is not what was needed, and
a skeletal parser for C-like languages (also developed using ANTLR) was a
better way forward. In the end, with a few good rules of thumb, I could find
all I needed from both C and C++ files, about as well as a commercial
product, although I have only tested it a couple of relatively
straightforward packages. I suspect it would fail badly on code which pushes
C macro and type definitions, such as that used in inter-language linking
systems. However, since that isn't what I'm trying to parse, I just decided
not to care about it.
 
Macros can usually be ignored a lot of the time. I was working on the
assumption that a macro would usually behave (a) like a variable, or (b)
like a function, depending on context of use. Yes, sometimes they can be
used in a way that is impossible to lex properly, but in the end I stripped
macro definitions from the source at a preprocessing stage, and just
recorded their existence later. 
 
It all probably depends on the depth of the parse you need. I didn't need
much, and I was working with code that was known to compile (Jim's useful
error hint wasn't needed), so a shallow skeletal parser worked fine. If I
needed to get into data flow, I'd probably have pushed harder at the C
grammar, and used backtracking a whole lot more, to see how good I could get
the AST. 
 
All the best
Stuart

-----Original Message-----
From: Jim Idle [mailto:jimi at temporal-wave.com]
Sent: Tuesday, February 05, 2008 11:26 AM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Slightly different C-Parser



Starting with the GNU C grammars isn’t a bad idea if you don’t feel up to
writing one from scratch. There is the basis of a C parser for ANTLR3 in the
downloadable examples tar (see the downloads page). However some of the
situations you pose here (such as multiple ambiguous macro definitions)
might make the job impossible. You need a pre-processor for C or would need
to just parse the macro references without expanding them and take them as
read. A lot of semantic checking would just be impossible. 

  

However it should be perfectly feasible to write a parser that will skip
includes if they are not there and just parse the structure of C, assuming
that the macros were written correctly and expect ‘;’ after them (some
people write macros with code in that contains the ‘;’). Then ANTLR3 will
recover automatically from a lot of common errors such as a missing ‘;’ and
so on., so you should get a long way without too much refinement. 



It can be a useful technique to program the structure of syntax errors that
are common, where you can do so without creating lots of ambiguities. This
will allow very specific messages to be produced. Generally you want your
parser to accept just about any old junk that seems like it might be
syntactically valid, then your semantic phase can be quite specific about
the problems it finds. So, when you see a language spec that says something
like ‘generics can’t be used here’, just use a common rule that allows them
anyway and avoids ambiguity, then your semantic check will pick up the
illegal use and flag it. When you are trying to do a partial parse like
this, the semantics will be more difficult and you may be given code that
would not be in error if a macro was defined. 

  

If possible, you should look to see if you cannot just get full source, but
I assume you have valid reasons for this partial parse. Assuming that there
is some specific purpose for this tool, then you might consider writing a
lexer and grammar that is much more specific to that purpose, perhaps only
pulling out the things you are inspecting from the lexer (if possible) or
perhaps you can write a pre-processor that fills in missing elements (or
takes partial definitions away or something). Whether or not a good place to
start for something like this is the existing grammar depends on what you
are trying to do I think. 

  

  

Jim 

  

From: Swapna R (GS-EC/EDG5 - RBIN/EMT2) [mailto:R.Swapna at de.bosch.com] 
Sent: Tuesday, February 05, 2008 7:53 AM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Slightly different C-Parser 

  

Dear All, 

  

 I think my question was not clear. 

  

 I downloaded the GNU C grammars but it is not supported in antlrworks
1.1.6. Can we find these grammars for this version ? 

  

 The parser we need has to work in a environment where some of the includes
are missing and also for one single macro there could be multiple
definitions. 

  

 So for these to satisfy we would require to take certain deviations from
the GNU C parser. IS this a good idea ? Do you suggest that we should write
entire parser according to our needs ? 

  

 Anyone of you please respond. 

  

regards, 

Swapna 

  

  _____  

   

From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Swapna R (GS-EC/EDG5
- RBIN/EMT2)
Sent: Tuesday, February 05, 2008 8:33 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Slightly different C-Parser 

Hello, 

 For my project work I need to parse C-files. Sometimes It can be that not
all header files are present. I have found the GNU-C grammars. 

 Is it a good idea to change these grammars for specific use-cases ? How do
you suggest me to go ahead ? 

 Incase of code not conforming to C-syntax instead of generating a exception
I want it to continue further and just creating a problem statment for the
problematic code. 

Can you anyone of you help me ? This is very important as we are under
pressure to take decision about the parser. 

Regards, 
Swapna 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080206/d0793e2f/attachment.html 


More information about the antlr-interest mailing list