[antlr-interest] <feff> ??

ian eyberg ian at telematter.com
Tue May 5 11:47:30 PDT 2009


Hi Jim,

  Thanks again for some help.

  So are you suggesting that when I read it I need to convert it
to UTF-16 if I detect the bom? (feff = utf-16, big-endian). Then
I should just throw in a lexer rule that does skips and make sure
that any line in my file can accept that first?

BOM     : '\u65279'{ skip(); };

On Tue, May 05, 2009 at 10:11:01AM -0700, Jim Idle wrote:
> ian eyberg wrote:
> > Hi,
> >
> >   someone has sent me a file to parse and there are all sorts of
> > '<feff>' characters in them in arbritrary spots -- looking it up
> > online it appears it's some sort of character to indicate what
> > encoding the strings are -- '(bom) byte order mark'
> >
> >   my question -- what should I do with these? should I accept that
> > some files are going to have these and convert them to spaces as a
> > sort of pre-processor or should I take the easy way out and say
> > "we don't support this" ;)
> >
> >   the person handing me the file says he never opened it in a text
> > editor and it was a piece of software on a OSX box
> >
> >   maybe if I detect a bom in one of my documents I can convert the
> > entire file to the appropriate encoding first??
> >
> > thanks,
> >
> >   
> You don;t say what target language you are going to use, but if you open 
> the file using the correcting encoding, then I believe it will take care 
> of the BOM for you. The BOM is optional but it indicates if the string 
> that follows is Big Endian or Little Endian, which is important when 
> reading UCS2 and similar character encodings (of more than one byte). 
> You can't ignore it if the machine you are parsing on has a different 
> ordering than the one where the file was created.
> 
> See: http://unicode.org/faq/utf_bom.html for more information than you 
> could possibly want to know about the BOM.
> 
> Jim
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
ian eyberg


More information about the antlr-interest mailing list