·  Encoding – Text – UTF-8

 

- Info:

  - UTF-8 encoding is one way to represent unicode character as sequence of bits for purpose of saving into memory.

  - UTF-8 stores integer code points of unicode characters using 1, 2, 3 or 4 bytes.

  - UTF-8 uses 1 byte per character for the ASCII characters,

  - UTF-8 uses 2 bytes per character for not that many more characters,

  - UTF-8 uses 3 bytes per character for most of the characters in the BMP,

  - UTF-8 uses 4 bytes per character to represent the characters in planes 1 to 16.

 

- Code points representation:

  - UTF-8 encoding represents code points of unicode characters as defined in following table.

 

  - Code points from interval [0,80ñ                 are represented with 1 byte which is exactly as in ASCII character set.

  - Code points from interval [80,800ñ         are represented with 2 bytes.

  - Code points from interval [800,10000ñ      are represented with 3 bytes.

  - Code points from interval [10000,200000ñ are represented with 4 bytes.

 

  - Splitting into bytes is done using following rules:

    bytes | bits | representation

      1   |    7 | 0vvvvvvv

      2   |   11 | 110vvvvv 10vvvvvv

      3   |   16 | 1110vvvv 10vvvvvv 10vvvvvv

      4   |   21 | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv

 

  - Here is little program that does this:

    putwchar(c){

      if (c < 0x80) {

        putchar (c);

      }

      else if (c < 0x800) {

        putchar (0xC0 | c>>6);

        putchar (0x80 | c & 0x3F);

      }

      else if (c < 0x10000) {

        putchar (0xE0 | c>>12);

        putchar (0x80 | c>>6 & 0x3F);

        putchar (0x80 | c & 0x3F);

      }

      else if (c < 0x200000) {

        putchar (0xF0 | c>>18);

        putchar (0x80 | c>>12 & 0x3F);

        putchar (0x80 | c>>6 & 0x3F);

        putchar (0x80 | c & 0x3F);

      }

    }

 

- Example - Encode Letter Š:

  - In unicode character set, letter Š is defined with code point U+0160 where 0160 represents hexadecimal number.

  - In UTF-8 encoding, all unicode characters whose code point is in interval [80,800ñ is represented with 2 bytes.

    6 least significant bits go into second byte also as 6 least significant bits and we add 10 to complete the byte.

    Remaining 5 most significant bits go to first byte and we add 110 to complete the byte.

    Š = 160 hex = 00101 100000 bin = 110 00101    10 100000 UTF-8 bin = C5A0 UTF-8 hex

    This way letter Š is represented with bytes C5A0.

  - You can test this with UltraEdit like this:

    - tart Ultra Edit - Paste following line: <meta http-equiv="content-type" content="text/html; charset=UTF-8">

    - dit – Hex Functions – Hex Edit –   Type: C5A0

    - ave file as C:\inetpub\wwwroot\test.html

    - Start Internet Explorer – http://localhost/test.html – This is displayed: Š

 

- Example - Encode Letters Š š Č č Ć ć Đ đ:

  - Using the above procedure following unicode characters can be presented in memory like this

    Š = 160   hex =  00101 100000 bin = 110 00101  10 100000 UTF-8 bin = C5 A0  UTF-8 hex

    š = 161   hex =  00101 100001 bin = 110 00101  10 100001 UTF-8 bin = C5 A1  UTF-8 hex

    Č = 10C   hex =  00100 001100 bin = 110 00100  10 001100 UTF-8 bin = C4 8C  UTF-8 hex

    č = 10D   hex =  00100 001101 bin = 110 00100  10 001101 UTF-8 bin = C4 8D  UTF-8 hex

    Ć = 106   hex =  00100 000110 bin = 110 00100  10 000110 UTF-8 bin = C4 86  UTF-8 hex

    ć = 107   hex =  00100 000111 bin = 110 00100  10 000111 UTF-8 bin = C4 87  UTF-8 hex

    Đ = 110   hex =  00100 010000 bin = 110 00100  10 010000 UTF-8 bin = C4 90  UTF-8 hex

    đ = 111   hex =  00100 010001 bin = 110 00100  10 010001 UTF-8 bin = C4 91  UTF-8 hex