[stringtemplate-interest] [ST4] Specify the encoding in the template group file

Udo Borkowski ub at abego-software.de
Sat Jan 29 04:17:24 PST 2011


Hi Ter,

> Hi. don't we need to know that the encoding is before we can load the file?

Actually not when we begin loading the file.

The whole approach is explained in detail in the XML reference documentation. Here the basic idea:

- Read the first 4 bytes (raw, no encoding needed)
- Because we know what characters this should be if there is a prolog ("<st(") we can now differentiate between these encodings:
	- USC-4
	- UTF-16
	- UTF-8
 	(this also covers things like little/big endian, octet order and Byte Order Mark)
- Once we know this we continue reading in the given encoding until we find the ")>". (All characters in the prolog are in ASCII.)
- If there is an encoding="…" we now know the exact encoding (e.g. when in UTF-8 mode we may find "ISO-8859-1").
- The rest of the file is read in the encoding we determined from the prolog.

If you like I can work out some code for this. Please let me know.

Udo

	.
On 28.01.2011, at 22:20, Terence Parr wrote:

> Hi. don't we need to know that the encoding is before we can load the file?
> Ter
> On Jan 27, 2011, at 8:24 AM, Udo Borkowski wrote:
> 
>> Hi,
>> 
>> I suggest to specify the encoding/charset of a file in the file itself. This follows the same arguments as for the delimiter:
>> 
>>> why are we doing
>>> 
>>> STGroup g = new STGroupFile("t.stg", "ISO-8859-1", '$', '$');
>>> 
>>> when the file itself and not the code should determine what the encoding is. the code should not care. If we change the encoding in the file, the code no longer works. (Ter, bold stuff by Udo)
>> 
>> I suggest to add a new (optional) group statement the group file must start with (if defined). If it is missing the (platform specific) default encoding is used.
>> 
>> Possible syntax
>> 
>> 	'encoding' STRING
>> 
>> For the GroupDir case the "group.config" file could be used again. However we may consider if the individual template files may specify individual charsets, too. (Note: this is different from the "delimiter" case where all template files share the same delimiters)
>> 
>> BUT: as I am writing this I noticed that this is very much like the stuff in the XML preamble:
>> 
>>> <?xml encoding='ISO-8859-1'?>
>> 
>> Maybe we should do something similar. What about making the group file (optionally) start with something like this:
>> 
>> 	<st(version="4.0", encoding="ISO-8859-1")>
>> 
>> Looks familiar?!
>> 
>> This "template call" style also gives us the option to add more information later.
>> 
>> 
>> Udo
>> 
>> P.S.: Using such a well defined file content prefix "<st(" will probably also allows us to support "encoding autodetection" (http://www.w3.org/TR/REC-xml/#sec-guessing) in the future.
>> 
>> _______________________________________________
>> stringtemplate-interest mailing list
>> stringtemplate-interest at antlr.org
>> http://www.antlr.org/mailman/listinfo/stringtemplate-interest
> 
> _______________________________________________
> stringtemplate-interest mailing list
> stringtemplate-interest at antlr.org
> http://www.antlr.org/mailman/listinfo/stringtemplate-interest



More information about the stringtemplate-interest mailing list