[antlr-interest] nested parsing (BSDL)

Sun Dec 30 01:08:27 PST 2007

I am trying to write a parser BSDL, which is a clunky language for
defining JTAG boundary scan information for an integrated circuit.
It is clunky because even though its designers apparently had better
ideas, design by committe dictated that it be "based on an existing
language" which turned out to be VHDL.   Instead of extending
the needed subset of VHDL, they stuffed a complex grammar into
strings.

A yacc/lex parser for outdated version 0.0 of the language is
here:   http://www.vhdl.org/vug_bbs/bsdl.parser
Part 5 of that message, also contains a sample file (search for the
third occurence of "8374".

Here is another sample file, that is cleaner than some:
http://www.standardics.nxp.com/support/documents/microcontrollers/txt/lpc2468fbd208.bsd
But the important subsets of the examples is excerpted below.

The problem is that the bulk of the structure of the language is
hidden inside strings.   Worse, the strings are broken at random
locations and concatenated with "&".

I have excerpted four of the important strings from the first so
you can see the problem:

constant DW_PACKAGE:PIN_MAP_STRING:="CLK:1, Q:(2,3,4,5,7,8,9,10), " &
           "D:(23,22,21,20,19,17,16,15),"  &
           "GND:6, VCC:18, OC_NEG:24, TDO:11, TMS:12, TCK:13, TDI:14";

attribute REGISTER_ACCESS of ttl74bct8374 : entity is
        "BOUNDARY (READBN, READBT, CELLTST),"   &
        "BYPASS (TOPHIP, SETBYP, RUNT, TRIBYP),"   &
        "BCR[2] (SCANCN, SCANCT)";   -- 2-bit Boundary Control Register

attribute INSTRUCTION_OPCODE of ttl74bct8374 : entity is
        "BYPASS (11111111, 10001000, 00000101, 10000100, 00000001),"   &
        "EXTEST (00000000, 10000000),"  &
        "SAMPLE (00000010, 10000010),"  &
        "INTEST (00000011, 10000011),"  &
        "TRIBYP (00000110, 10000110),"  &     -- Boundary Hi-Z
        "SETBYP (00000111, 10000111),"  &     -- Boundary 1/0
        "RUNT   (00001001, 10001001),"  &     -- Boundary run test
        "READBN (00001010, 10001010),"  &     -- Boundary read normal
        "READBT (00001011, 10001011),"  &     -- Boundary read test
        "CELLTST(00001100, 10001100),"  &     -- Boundary selftest normal
        "TOPHIP (00001101, 10001101),"  &     -- Boundary toggle out test
        "SCANCN (00001110, 10001110),"  &     -- BCR Scan normal
        "SCANCT (00001111, 10001111)";        -- BCR Scan test

attribute BOUNDARY_REGISTER of ttl74bct8374 : entity is
       -- num cell  port   function safe [ccell disval rslt]
         "17 (BC_1, CLK,    input,   X),"  &
         "16 (BC_1, OC_NEG, input,   X),"  &  -- Merged Input/Control
         "16 (BC_1, *,      control, 1),"  &  -- Merged Input/Control
         "15 (BC_1, D(1),   input,   X),"  &
         "14 (BC_1, D(2),   input,   X),"  &
         "13 (BC_1, D(3),   input,   X),"  &
         "12 (BC_1, D(4),   input,   X),"  &
         "11 (BC_1, D(5),   input,   X),"  &
         "10 (BC_1, D(6),   input,   X),"  &
         "9  (BC_1, D(7),   input,   X),"  &
         "8  (BC_1, D(8),   input,   X),"  &
         "7  (BC_1, Q(1),   output3, X,  16, 1, Z)," &  -- cell 16 @ 1 -> 
Hi-Z.
         "6  (BC_1, Q(2),   output3, X,  16, 1, Z)," &
         "5  (BC_1, Q(3),   output3, X,  16, 1, Z)," &
         "4  (BC_1, Q(4),   output3, X,  16, 1, Z)," &
         "3  (BC_1, Q(5),   output3, X,  16, 1, Z)," &
         "2  (BC_1, Q(6),   output3, X,  16, 1, Z)," &
         "1  (BC_1, Q(7),   output3, X,  16, 1, Z)," &
         "0  (BC_1, Q(8),   output3, X,  16, 1, Z)";

The yacc/lex implementation only calls yyparse() once.   It utilizes
a trick where it changes the getc() function used by the lexar
in midstream to splice the concatenated strings into the input.

Now, there are several possible ways of handling this:
   - Call a separate parser on the contents of each complex string.
     This will require at least 5 separate parsers and would
     generate 5 seperate ASTs, so it doesn't seem like a very
     elegant solution.
   - perform some sort of getc() substitution as the yacc/lex
     implementation did
   - Use some sort of replace operator.
     Antlr has, for example:
       ( blah )?
     which means ( blah ) is optional.   Is there something similar
     like
       ( blah )magic
     which means:
       - discard ( blah )
       - insert the modified version of blah into the input stream
         as if it occurred in the original text here.
     Or, more likely, something of the form
        ( blah ) {
           newstring=concatenate(...);
           discard();
           insert_here(newstring);
        }
It appears that the inner and outer grammars are not incompatable.

It also isn't clear to me from the documenation what the best
way to concatenate strings is.  I could not see that that was
handled in the ansi C

My current definition of a string (normal or structured) is:
STRING_LITERAL: ('"' ~('"')* '"' ) ('&' ('"' ~('"')* '"' ) )* ;
While "foo""bar" is a valid string in VHDL, it isn't in BSDL
so that isn't a problem.

generic_parameter: 'generic' '(' 'PHYSICAL_PIN_MAP' ':' 'string' ':=' 
STRING_LITERAL ')' ';';

which would probably get rewritten as something like:
generic_parameter: 'generic' '(' 'PHYSICAL_PIN_MAP' ':' 'string' ':='
(STRING_LITERAL){ fixme() }  ')' ';';
where fixme() is whatever performs the concatenation and reinsertion.

Fortunately, the BSDL subset of VHDL does not allow expressions other
than the concatenation of constant strings.

It appears. from the "How do I strip quotes?" FAQ entry  that in java, 
setText() and getText() might be part of the solution, but these
do not appear to be documented and probably do not cause reparsing.

A VHDL grammar is here:
http://www.antlr.org/grammar/1086696923011/vhdlams/index.html

Antlr supports rewrite rules but those seem to be at the AST level
and don't appear to result in reparsing.

BTW, my target language will be C for portability (this will be
a library).   It does not even handle litteral string concatenation, this
is left to the constant expression handling.

Of course, it would be better if this were possible to do directly in the
antlr grammar itself.   Tricks played from code fragments would be
invisable to tools like antlrworks.

It seems that there may be many languages where something similar is
needed.   For example,
    <date value="1999-03-02">
    </date>
might need to parse the date string.

For the general case, it is also better if this is done in such a way
that it doesn't matter if the subordinate grammar is compatable with
the parrent grammar.   To do this, you would probably specify a rule from
which to start parsing and might need to specify a different lexical
rule scope.   lexrules "main" { ... }, lexrules "subordinate" { ... }.

I am using antlr 2.7.6 (the latest version with debian package) but if
v3 handles this better, it would warrant an upgrade.