[antlr-interest] Lexer ambigiuoties

"Paul Bouché (NSN)" paul.bouche at nsn.com
Wed Feb 18 05:27:12 PST 2009


Johannes Luber schrieb:
> The deeper problem lies in the fact that ANTLR uses an insufficent algorithm to sort out - for humans - non-ambiguous input in all cases correctly.
 From the book I glean that LL(*) does cover all context free languages. 
Those for humans non ambiguous but for computers ambiguous cases are 
only non ambiguous to humans because they have context? Because a blank 
or any other character for that matter may be interpreted as white space 
in one case it shall be interpreted differently in another case. The 
difference between those cases is context, i.e. what came before and 
what the next k-ahead symbol is.

Or could you be more concrete by what you mean with "uses an insufficent 
algorithm" - ah I just thought that the parser is LL(*) but the lexer 
uses a cyclic DFA for prediction  which  may not cover all context free 
languages and certainly not context-sensitive.

BR,
Paul

Paul
>  Not sure if changing the algorithm would help here, too, but it would at least simplify the common cases. Unfortunately, it isn't clear when Ter does finally do a rewrite here.
>
> Johannes
>   
>> Johannes Luber schrieb:
>>     
>>> Paul Bouché (NSN) schrieb:
>>>   
>>>       
>>>> Hi,
>>>>
>>>> I have a lexer which already recognizes valid tokens of different
>>>>         
>> types, 
>>     
>>>> e.g. an integer will generate an integer token, a quoted string a
>>>>         
>> string 
>>     
>>>> token, an ip address and ipaddress token etc.
>>>> E.g:
>>>>
>>>> property : key '=' value;
>>>> key : Name;
>>>> value : Integer | String | Ipaddress;
>>>> Name : ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '-' | ':' | '%')+
>>>> Integer : ('+'|'-')? ('0'..'9')+;
>>>> Ipaddress : ('0'..'9')+ '.' ('0'..'9')+ '.' ('0'..'9')+ '.' ('0'..'9')+
>>>> // simplified, actual grammar is correct max of three digits
>>>> String :  ( '\'' ( STRING_ | '`' | '"' | '\\' '\'' )* '\''
>>>>          | '"' ( STRING_ | '`' | '\'' | '\\' '"' )* '"'
>>>>          );
>>>> WHITESPACE
>>>>        :
>>>>        ( ' ' | '\t' | '\n' )+
>>>>        { skip(); }
>>>>        ;
>>>>
>>>> All works fine. Now I need to include unquoted strings with blanks. The
>>>> problem is '0 ' (zero blank - without quotes of course). I cannot get 
>>>> the lexer to match this as an Integer as before. Basically I want a
>>>>         
>> rule 
>>     
>>>> which says, if it is not something of the previous tokens, try if is an
>>>> unquoted string. Of course an unquoted string may not have newlines.
>>>> Any hints on how to archive this?
>>>> I tried everything and ran several times into code too large exceptions
>>>> because the actual grammar is much more complex (there are more
>>>>         
>> unquoted 
>>     
>>>> values which are recognized by certain prefixed characters such as < 0x
>>>> :: etc.).
>>>>
>>>> Thanks a bunch!
>>>> Paul
>>>>
>>>>     
>>>>         
>>> Try to set the appropriate type later like it is done here:
>>>
>>>       
>> <http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs>
>>     
>>> Johannes
>>>   
>>>       
>
>   


-- 
Paul Bouché
Voice: +49 30 590080-1284
 
Nokia Siemens Networks GmbH & Co. KG, An den Treptowers 1, 12435 Berlin, Germany
Sitz der Gesellschaft: München / Registered office: Munich
Registergericht: München / Commercial registry: Munich, HRA 88537
WEEE-Reg.-Nr.: DE 52984304

Persönlich haftende Gesellschafterin / General Partner: Nokia Siemens Networks Management GmbH
Geschäftsleitung / Board of Directors: Lydia Sommer, Olaf Horsthemke
Vorsitzender des Aufsichtsrats / Chairman of supervisory board: Lauri Kivinen
Sitz der Gesellschaft: München / Registered office: Munich
Registergericht: München / Commercial registry: Munich, HRB 163416

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090218/d35449cf/attachment.html 


More information about the antlr-interest mailing list