Encoding Text Define encoding Text Files

 

- Info:

- Normal text files don't support special tags, like HTML files do, which can be used to define encoding.

This is why such files contain information about encoding in few first bytes, which are called Byte Order Mark.

- For example if UTF-16 is used then each unicode character is represented in memory with exactly 2 bytes.

In big endian systems, where most significant byte comes first, letter which has code U+0160, is stored as 0160.

In little endian systems, where least significant byte comes first, letter which has code U+0160, is stored as 6001.

- Byte Order Mark Encoding File with single letter

EF BB BF UTF-8 EF BB BF C5 A0

FF FE UTF-16, little endian FF FE 60 01

FE FF UTF-16, big endian FE FF 01 60

FF FE 00 00 UTF-32, little endian FF FE 00 00 60 01 00 00

00 00 FE FF UTF-32, big-endian 00 00 FE FF 00 00 01 60

- Additional information can be found at:

http://www.icu-project.org/docs/papers/forms_of_unicode/

http://msdn2.microsoft.com/en-us/library/ms776429.aspx

http://www.noveltheory.com/TechPapers/endian.asp