[antlr-interest] Best practice to handle Lexer backtracking demand

Joachim Schrod jschrod at acm.org
Sat Aug 14 01:17:05 PDT 2010


Gerald Rosenberg wrote:
>   ------ Original Message (Saturday, August 14, 2010 3:41:08 
> AM) From: Joachim Schrod ------
> Subject: Re: [antlr-interest] Best practice to handle Lexer backtracking
> demand
> 
>> And, in case it's not clear, the example was made up by me to have
>> a minimal example to discuss. The real DSL is simple, but not as
>> simple:
> Going to have to work on my mind-reading...
>> I have several marker strings that delimit character
>> sequences w/o any marker string in them, need to determine just
>> that text between the markers, and need to check if the marker
>> strings come in the right order and what is between them. I.e.,
>> lexical filters are not sufficient, I need input validation as well.
> Please post an example -- input and grammar -- that properly matches the 
> complexity of your problem.

Input: aaaaXXXXbbbb wwwXXX ddddYYYzzzz

XXXX is the start marker, YYY is the end marker.
A non LL(1)-prefix of the marker(s) (here `XXX') may appear in between.
There is no separating char between markers and surrounding text.
Markers don't start with the same char.

There are ca. 20 start and 20 end markers, all of them different.
(This is a data file that must be analyzed.)

I need to detect (and validate) space separated words between the
markers.

With lexer backtracking, the grammar would read

----------
input : text
	START1 word (WS word)* END1
	word WS word
	START2 word (WS word)* END2
	text
	START1 word (WS word)* END1
	EOF
	;
text : (WS|NONWS)* ;
word : NONWS+ ;

START1 : 'XXXX' ;
END1 : 'YYY' ;
START2 : 'HHHH' ;
END2 : 'MMMM' ;
fragment WHITE : ' ' | '\n' ;
WS : WHITE ;
NONWS : ~WHITE ;
----------

with more START and END definitions, of course. And input is
longer, as it defines the whole data file structure, e.g., with
some alternatives in marker order or such.

Defining word as a lexer token is not possible as it would match
'aaaXXXXbbbb' and would be the longest match. Some different
lexical grammar? There is no selection of characters that are
specific to the markers. The definition of a word is really `a
sequence of non-whitespace characters that is followed by one of
the 40+ marker strings or a whitespace'. Using lexical filters is
not possible either; I analyze and validate the complete data file.

I hope this makes it more clear why I asked for the best practice
for lexical backtracking. It's at the heart of the grammar above
and was the minimal example that I came by that explains my problem
on technical level.

	Joachim

PS: I don't construct a tree, there are actions triggered by the
markers, the words are the parameters.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Joachim Schrod				Email: jschrod at acm.org
Roedermark, Germany



More information about the antlr-interest mailing list