[antlr-interest] "An Introduction to ANTLR" presentation slides

Fri Feb 29 19:08:34 PST 2008

At 07:25 1/03/2008, Andy Tripp wrote:
>Syntax, as defined by dictionary.com, wikipedia, and (my 
>understanding of) common usage, means not the structure of just 
>any symbols, but the structure of written language symbols (i.e. 
>"words" or "tokens"). Thus, "syntax checking" is something a 
>parser does, but not something that a lexer or treewalker does.

Syntax is the structure of any kind of distinct symbols.  In this 
case the "word" is the fundamental unit and the "sentence" is the 
grouping of those units:

  - in a lexer, a "word" is a character and a "sentence" is a 
token
  - in a parser, a "word" is a token and a "sentence" is a set of 
tokens or an AST
  - in a tree walker, a "word" is an AST and a "sentence" is a set 
of ASTs

>The "meaning" or "semantics" for a lexer is the sequence of 
>output tokens.
>The "meaning" or "semantics" for a parser is the output AST.
>The "meaning" or "semantics" for a treewalker is whatever it 
>outputs (some modified AST or whatever).

No.  Those are the output syntax forms of each (what I referred to 
as "sentences" above) -- they do *not* represent semantics or 
meaning.

If you take an ANTLR grammar and remove all action code from it, 
then it will still take in input syntax and generate output 
syntax, but no inherent meaning is associated with it.  Thus left 
to itself ANTLR is a pure syntax recogniser/generator.  In 
addition to this is also permits semantic validation and 
constructs to be included, but this is convenience and is not 
essential to operation (except possibly for some syntactically 
ambiguous languages).

>We NEVER see an AST being referred to as a "syntax diagram" (or 
>"syntax" anything) - we call it an AST.

Yes, and what does AST actually stand for?  Abstract Syntax 
Tree.  Oh look, it *is* referred to as "syntax".

Perhaps another more concrete example is in order here.  The input 
is:

   int x = doCalculation(5);

This is a character stream which the lexer might convert into the 
token stream:

   KEYWORD[int] IDENTIFIER[x] ASSIGN[=] IDENTIFIER[doCalculation] 
OPAREN[(] NUMBER[5] CPAREN[)] SEMI[;]

The parser takes that token stream and converts it into the 
following AST:

   ( DECLARATION KEYWORD[int] IDENTIFIER[x] )
   ( ASSIGN[=]
     IDENTIFIER[x]
     ( FUNCTIONCALL IDENTIFIER[doCalculation] ( NUMBER[5] ) )
   )

Everything we have done up to this point is still all just 
syntax.  This is a perfectly valid AST and thus the input is valid 
syntactically.

But what happens when we start to verify the semantics?  What 
happens if it turns out that "doCalculation" isn't actually a 
function, or doesn't take a single numeric parameter, or doesn't 
return a type that can be compatibly assigned to an integer 
variable?  What happens if "x" had already been declared as a 
string variable?  Those are all semantic tests and they are 
independent of the AST itself.  So what was perfectly valid syntax 
may be semantically incorrect.

ANTLR lets you choose whether you want to do the semantic checks 
right at the end (in your own driver code, or in a tree walker), 
or whether you want to do them inline at the lexing or parsing 
stages (either to fail quickly or to resolve syntactical 
ambiguities).  But being able to insert them inline doesn't mean 
that they're directly linked the way you seem to have been 
saying.  They remain separate and distinct things.

I wonder if this is a similar confusion to that caused by having a 
combined grammar with literals in the parser rules -- it's 
permitted for convenience but it doesn't change the fact that 
they're treated separately (by modifying the lexer).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080301/d98a1049/attachment-0001.html