[antlr-interest] Match the start and end of a line

Gary R. Van Sickle g.r.vansickle at att.net
Thu Dec 25 08:18:05 PST 2008


> From: Gavin Lambert
> 
> At 22:17 25/12/2008, Gary R. Van Sickle wrote:
>  >translation_unit
>  >    : (BOL statement EOL)+
>  >    ;
>  >
>  >You'd have to be throwing up WS tokens as well though for  
> >that to be buying you anything.
> 
> Actually even in that case it doesn't really buy you 
> anything, unless EOLs can occur in other contexts as well.  
> (And even then it's doubtful -- it'd just make parsing harder.)
> 

It would make parsing harder no doubt, but I'm thinking of cases such as in
SPICE, where (at least for some definitions of the term "SPICE"), the first
token on a line must start in the first column, i.e.:

OK:
"R1 0 1 1k\n"

Not valid:
"    R1 0 1 1k\n"

Some crusty old C preprocessors want the "#" in the first column as well.
Now, the utility of such restrictions my be dubious if you're writing a
recognizer, but maybe you're writing a validator to determine if the given
SPICE deck or C file will get through the crustiest of the crusty old
SPICEes or C preprocessors.

So having slept on it, and given the above rationale, rules like these would
make some sort of sense:

spice_resistor_declaration
    : BOL name=ID WS node0=ID WS node1=ID WS value=ID WS EOL
        // PS: Yeah, a SPICE deck is a virtually-unparseable atrocity of a
"language".
        // Without some form of context tracking or feedback from the
parser, it's simply
        // not possible for the lexer to tell a component-type-plus-name
from a node ID from a literal value.
        // Welcome to my world ;-(.
    ;

cpp_define
    : BOL '#' WS 'define' WS ID WS ( '(' WS define_param_list WS ')' )? WS
define_body WS EOL
    ;

So, yeah, it buys you a mess, that's for sure.  Or wait, I think I see your
point, are you saying that if you're explicitly handling WS's in the parser,
the BOL buys you nothing?  I think you're right, these would I think be
equivalent to the above, and require no BOL complications:

spice_resistor_declaration
    : name=ID WS node0=ID WS node1=ID WS value=ID WS EOL
    ;

cpp_define
    : '#' WS 'define' WS ID WS ( '(' WS define_param_list WS ')' )? WS
define_body WS EOL
    ;

> 
> But I think the OP would need to explain a bit more about 
> *why* they're interested in line beginnings and endings 
> before we can be of more help.
> 

Indeed.  One thing I do know is that it would make parsing SPICE a bit more
tractable.  If you could do something like this, it would be one small step
for man:

RESISTOR_ID : ^R[[:alnum:]_]+ ;

spice_resistor_declaration
    : RESISTOR_ID node0=ID node1=ID value=ID
        // Hey hey, welcome to 1973!  We can at least tell our component
types from our node IDs and literal values now!
    ;

Wouldn't need explicit EOLs either.

Is there a reason why the ANTLR lexer doesn't/can't support full regexes?

-- 
Gary R. Van Sickle
 




More information about the antlr-interest mailing list