[antlr-interest] Really big generated C lexer?

Thu Apr 21 12:27:07 PDT 2011

Hi Jim,

Would you mind elaborating on this just a bit? You bring up this same 
concept quite frequently on this list, but there is something that I am 
missing. I understand how matching more generically in earlier stages 
will lead to better error message in later stages, but this seems to me 
to be very inefficient. Taking the GUID example, you could have a very 
simple lexer rule:

GUID: '{' (HEX_DIGIT|'-')+ '}';

This would match a valid GUID, as well as something close (including a 
lot of common mistaken forms I'm sure). This assumes that the parsing 
stage will validate this token further in its semantics. That's all well 
and good, but the problem is now you have to implement a GUID parser 
within the semantic actions of the original parser. A GUID is a rather 
simple example, so I could easily whip up some string manipulation code 
to validate the format, but this seems a bit odd to me. The whole reason 
we are using ANTLR in the first place is to avoid writing our own 
recognizers from scratch, so I feel like I shouldn't need to write one 
within the semantics of the ANTLR generated one. Would this be an 
appropriate use case for island grammars? This seems like a lot of added 
complexity (in both written code and execution time), seemingly only for 
the benefit of better error messages. The beauty of Chris's original 
implementation of GUID is that it does not require any validation in 
semantics - the lexer will only match a completely valid GUID. As you 
said, the error message may not seem all that helpful. The example that 
I gave above has nearly the same problem though. It will allow you to 
put the '-' anywhere in the GUID, but will give just as bad of an error 
if a '~' character is in the middle of the input. So perhaps a better 
lexer rule would be:

GUID: '{' ~('}')+ '}'

Which will accept all sorts of screwed up syntax, but allow you to 
produce an error like "A GUID cannot contain the character '~'. A GUID 
should be of the form ...". This makes the GUID recognizer within the 
semantics even more complex, because it has to handle even more invalid 
possibilities.

I realize that there is probably a good balance between the two 
extremes, but it is not clear to me where that would be. I guess my 
question is, is this purely a trade-off between speed/simplicity and 
error handling, or am I missing something that would allow one to get 
the best of both worlds?

Thanks,

- Justin

On 4/21/2011 2:36 PM, Jim Idle wrote:
> You are just trying to do too much in the lexer really so it means you get
> a lot of tables. Left factor and don't try to validate things in the
> lexer. For instance you just need a very generic rule for matching a GUID
> and then verify it semantically.
>
> Use antlr.markmail.org for getting advice on pushing error messages as far
> down the tool chain as you can and why these kinds of things happen.
> Basically though you will get:
>
> Unexpected character 'x'
>
> vs:
>
> Line 12, offset 33: "A GUID should be of the form XXXX-XXXXXXX- ...."
>
> But, the C compiler will do a good job of dealing with the code you
> generated.
>
> Jim
>
>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of Chris McConnell
>> Sent: Thursday, April 21, 2011 11:23 AM
>> To: antlr-interest at antlr.org
>> Subject: [antlr-interest] Really big generated C lexer?
>>
>> The attached grammar generates a C lexer file of 150,000 lines.  Is
>> this typical or did I do something dumb in the grammar?  I'd attach the
>> C lexer file, but it is 10mb...
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address