[antlr-interest] Best practice to handle Lexer backtracking demand

Sat Aug 14 10:57:14 PDT 2010

------ Original Message (Saturday, August 14, 2010 10:17:05 
AM) From: Joachim Schrod ------
Subject: Re: [antlr-interest] Best practice to handle Lexer backtracking
demand
>
> Input: aaaaXXXXbbbb wwwXXX ddddYYYzzzz
>
How is XXXX guaranteed to be unambiguous with any other fragment of 
aaaaXXXXbbbb?  That is, how can you be sure that a fragment like aaaX or 
XXXXb will never match a different start marker.  Is there a case 
distinction, as implied, or something more interesting?  Is the 
distinction the same for the end marker?

> XXXX is the start marker, YYY is the end marker.
> A non LL(1)-prefix of the marker(s) (here `XXX') may appear in between.
> There is no separating char between markers and surrounding text.
> Markers don't start with the same char.
>
> There are ca. 20 start and 20 end markers, all of them different.
> (This is a data file that must be analyzed.)
>
> I need to detect (and validate) space separated words between the
> markers.
>
> With lexer backtracking, the grammar would read
>
> ----------
> input : text
> 	START1 word (WS word)* END1
> 	word WS word
> 	START2 word (WS word)* END2
> 	text
> 	START1 word (WS word)* END1
> 	EOF
> 	;
> text : (WS|NONWS)* ;
> word : NONWS+ ;
>
> START1 : 'XXXX' ;
> END1 : 'YYY' ;
> START2 : 'HHHH' ;
> END2 : 'MMMM' ;
> fragment WHITE : ' ' | '\n' ;
> WS : WHITE ;
> NONWS : ~WHITE ;
> ----------
>
> with more START and END definitions, of course. And input is
> longer, as it defines the whole data file structure, e.g., with
> some alternatives in marker order or such.
>
> Defining word as a lexer token is not possible as it would match
> 'aaaXXXXbbbb' and would be the longest match. Some different
> lexical grammar? There is no selection of characters that are
> specific to the markers. The definition of a word is really `a
> sequence of non-whitespace characters that is followed by one of
> the 40+ marker strings or a whitespace'. Using lexical filters is
> not possible either; I analyze and validate the complete data file.
>
> I hope this makes it more clear why I asked for the best practice
> for lexical backtracking. It's at the heart of the grammar above
> and was the minimal example that I came by that explains my problem
> on technical level.
>
> 	Joachim
>
> PS: I don't construct a tree, there are actions triggered by the
> markers, the words are the parameters.
>

-- 

Gerald B. Rosenberg, Esq.
NewTechLaw
260 Sheridan Ave., Suite 208
Palo Alto, CA 94306-2009
650.325.2100 (office) / 650.703.1724 (cell)
650.325.2107 (facsimile)

www.newtechlaw.com

CONFIDENTIALITY NOTICE: This email message (including any attachments) 
is being sent by an attorney,
is for the sole use of the intended recipient, and may contain 
confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited. 
If you are not the intended
recipient, please contact the sender immediately by reply email and 
delete all copies of this message
and any attachments without retaining a copy.