[antlr-interest] <feff> ??

Tue May 5 10:11:01 PDT 2009

ian eyberg wrote:
> Hi,
>
>   someone has sent me a file to parse and there are all sorts of
> '<feff>' characters in them in arbritrary spots -- looking it up
> online it appears it's some sort of character to indicate what
> encoding the strings are -- '(bom) byte order mark'
>
>   my question -- what should I do with these? should I accept that
> some files are going to have these and convert them to spaces as a
> sort of pre-processor or should I take the easy way out and say
> "we don't support this" ;)
>
>   the person handing me the file says he never opened it in a text
> editor and it was a piece of software on a OSX box
>
>   maybe if I detect a bom in one of my documents I can convert the
> entire file to the appropriate encoding first??
>
> thanks,
>
>   
You don;t say what target language you are going to use, but if you open 
the file using the correcting encoding, then I believe it will take care 
of the BOM for you. The BOM is optional but it indicates if the string 
that follows is Big Endian or Little Endian, which is important when 
reading UCS2 and similar character encodings (of more than one byte). 
You can't ignore it if the machine you are parsing on has a different 
ordering than the one where the file was created.

See: http://unicode.org/faq/utf_bom.html for more information than you 
could possibly want to know about the BOM.

Jim