[antlr-interest] Lexer ambigiuoties

Wed Feb 18 02:25:13 PST 2009

Hi,

thanks for the pointer - very interesting. I found the comment by Lonnie 
VanZandt enlightning: "

Forgive me, I am dimwitted: I see that the point of the example is to 
illustrate insitu error reporting for malformed inputs. However, is the 
example /also/ the canonical recommended way to write a grammar for 
parsing numeric literals--without regard to whether or not one kindly 
detects malformities?

For example, programmatically peeking ahead via input.LA(2) rather than 
using a pattern matcher that implicitly looks at the character two 
characters ahead. Is that the preferred style?

Also, is this compound case/switch rule the recommended approach versus 
a collection of "token-specific" rules? That is, isn't it more 
elegant-when possible-to write a separate rule for OCTAL_LITERAL, 
FLOATING_POINT_LITERAL, HEX_LITERAL, TIME_LITERAL, etc? The approach 
shown seems very procedural versus a declarative approach.

(I realize that I can't have my Elegance Cake and eat it in the presence 
of ambiguous sentences...)

".

The deal is we have a lexer in place which does exactly what Lonnie 
suggest in the last sentence. A lexer which has seperate rules for each 
of token types. The given example here imo is more or less a combination 
of hand-written lexer and auto-generation. I guess if you want to do 
more complex stuff with ANTLR which though is needed for so called real 
world applications you need to get down and dirty, or you have to be a 
Terrance Parr ;-)

The example illustrates that it can be done, but it would mean a half 
rewrite of the current lexer.

Thanks,
Paul

Johannes Luber schrieb:
> Paul Bouché (NSN) schrieb:
>   
>> Hi,
>>
>> I have a lexer which already recognizes valid tokens of different types, 
>> e.g. an integer will generate an integer token, a quoted string a string 
>> token, an ip address and ipaddress token etc.
>> E.g:
>>
>> property : key '=' value;
>> key : Name;
>> value : Integer | String | Ipaddress;
>> Name : ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '-' | ':' | '%')+
>> Integer : ('+'|'-')? ('0'..'9')+;
>> Ipaddress : ('0'..'9')+ '.' ('0'..'9')+ '.' ('0'..'9')+ '.' ('0'..'9')+ 
>> // simplified, actual grammar is correct max of three digits
>> String :  ( '\'' ( STRING_ | '`' | '"' | '\\' '\'' )* '\''
>>          | '"' ( STRING_ | '`' | '\'' | '\\' '"' )* '"'
>>          );
>> WHITESPACE
>>        :
>>        ( ' ' | '\t' | '\n' )+
>>        { skip(); }
>>        ;
>>
>> All works fine. Now I need to include unquoted strings with blanks. The 
>> problem is '0 ' (zero blank - without quotes of course). I cannot get 
>> the lexer to match this as an Integer as before. Basically I want a rule 
>> which says, if it is not something of the previous tokens, try if is an 
>> unquoted string. Of course an unquoted string may not have newlines.
>> Any hints on how to archive this?
>> I tried everything and ran several times into code too large exceptions 
>> because the actual grammar is much more complex (there are more unquoted 
>> values which are recognized by certain prefixed characters such as < 0x 
>> :: etc.).
>>
>> Thanks a bunch!
>> Paul
>>
>>     
> Try to set the appropriate type later like it is done here:
> <http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs>
>
> Johannes
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090218/91c2feb6/attachment.html