u32u8 transcoder

u32u8 Transcoder Using Parallel Bit Streams

The overall structure of UTF-32 to UTF-8 conversion has 5 phases.

Determination of filedmask and extraction mask for the particular file to be converted. This scheme identifies the number and positions of ASCII and non-ASCII (2-byte, 3-byte and 4-byte sequences) code points to be generated for each of the input Unicode codeunit.
Compressing the u8final stream which indicates the exact length of each UTF-8 converted code units. u8final is the compressed streamset which is generated using the u8filedmask and extraction mask to extract the UTF-8 code unit positions.
Deposit masks for UTF-8 code unit sequences are calculated by bitwise operations on u8final. 3 deposit mask streams, namely u8initial, u8mask12_17 and u8mask6_11 are generated. u8initial marks the initial positions of the UTF-8 code unit sequences. u8mask12_17 identifies the positions at which bits 12 through 17 are to be deposited for each character. Similarly, u8mask6_11 identifies the positions at which bits 6 through 11 are to be deposited for each Unicode character.
u32basis parallel bit stream is spread in accordance with the deposit masks to calculate the partial UTF8 code unit sequences.
The partial UTF-8 code unit sequences are assebled to produce the complete transcodes UTF-8 code unit sequences.

Example

As a simple running example, we use the following UTF-32 codepoints.

𐀣 ॐ ā c

Corresponding Unicode codepoints are as follows:

0000000 0023 0001 0020 0000 0950 0000 0020 0000
0000010 0101 0000 0020 0000 0063 0000
000001c

The data stream is parsed to mark the number of byte fields to be extracted from UTF-32 codepoints to transcode into UTF-8 code units. u8fieldmask is calculated with the use of an if hierarchy where determining the UTF-8 sequence length is achieved by looking at the bit value at particular positions. The hierarchy is as follows:

If any of bits 16 through 20 are 1, a four-byte UTF-8 sequence is required. Hence the u8fieldmask of 0001 is used.
Otherwise, if any of bits 11 through 15 are 1, a three-byte UTF-8 sequence is required resulting in u8fieldmask of 001.
If any of bits 7 through 10 are 1, a two-byte UTF-8 sequence is required, u8fieldmask would be 01.
For all the ASCII codepoints, a fieldmask of 1 is used.

The first codepoint 𐀣 needs 4-bytes of UTF-8 bytes to be represented. Hence it's extraction mask is 1111. As it is a 4-byte codepoint, the u8fieldmask marks the final/4th byte position of the transcoded codepoint, which is 0001. Similar technique is applied for 1,2 and 3-byte sequences too in order to generate exraction mask and field masks.

The second phase involves compressing the extraction mask in reference with fieldmask to produce the streamset u8final marking the final positions of to be transcoded UTF-8 sequences.

Data stream:     𐀣 ॐ ā c
Hex:             00 01 00 23 00 00 00 20 00 00 09 50 00 00 00 20 00 00 01 01 00 00 00 20 00 00 00 63
u8fieldmask:     0  0  0  1           1     0  0  1           1        0  1           1           1   
extraction mask: 1  1  1  1  .  .  .  1  .  1  1  1  .  .  .  1  .  .  1  1  .  .  .  1  .  .  .  1                     
                 \ \   / /   \  \  /  /  \  \  /  /  \  \  /  /  \  \  /  /  \  \  /  /  \  \  /  /                          
u8final:          0 0 0 1        1           0 0 1        1            0 1        1           1

In the third phase, following Parabix methods and bitwise operations, we define bit streams for the deposit masks which represent the positions for inserting the data streams to get partially transcoded UTF-8 sequences.

u8final:        ...11..11.111...
u8initial:      1...11..11.11...
u8mask12_17:    .1..11..11.11...
u8mask6_11:     ..1.1.1.11.11...

The fourth phase is spreading the input data stream with respect to the deposit masks calculated. While spreading the input data stream, an offset value is considered so that the bits being spread are in the correct position of the UTF-8 code units.

spread by mask: u8final, u32basis, deposit0_5, offset=12
u8final =              ...11..11.111...
u32basis[0] =          ...1......1.1...
u32basis[1] =          ...11...1..11...
u32basis[2] =          ................
u32basis[3] =          ....1...1..1....
u32basis[4] =          .......1........
u32basis[5] =          ...1........1...

The spreading of all the other basis bit streams is performed with respect to the deposit mask to create the partial UTF-8 code units.

The final part is to assemble the deposit bits to form complete sequences of UTF-8 code units. A few basic properties of UTF-8 code units are utilized while assembling the deposit bits.

With the help of the bitwise operations on deposit bits and deposit masks, UTF-8 assembly kernel fits the depoist bits in the right places of multibyte UTF-8 code unit sequences.

Extracting the ASCII cases where the deposit bit 6 is used as bit 6 of codepoint.

PabloAST * ASCII = pb.createAnd(u8initial, u8final);
ASCII =     ....1...1..11
PabloAST * nonASCII = pb.createNot(ASCII, "nonASCII");
nonASCII =  1111.111.11..

Extract the positions where deposit bit 6 is used for ASCII codeunit. Update the deposit bit 6 stream to reflect the markings of non ASCII codepoints.

PabloAST * ASCIIbit6 = pb.createAnd(dep6_11[0], ASCII);
ASCIIbit6 =     ............1...
dep6_11[0] = pb.createAnd(dep6_11[0], nonASCII);
dep6_11[0] =    ......1.........

Extract the ASCII code points by performing bitwise operations on deposit bits.

    for (unsigned i = 0; i < 6; i++) {
        u8basis[i] = pb.createOr(dep0_5[i], dep6_11[i]);
        u8basis[i] = pb.createOr(u8basis[i], dep12_17[i], "basis" + std::to_string(i));
        if (i < 3) u8basis[i] = pb.createOr(u8basis[i], dep18_20[i]);
    }

u8basis[0] =  ...1..1...1.1
u8basis[1] =  ...1........1
u8basis[2] =  ......1..1...
u8basis[3] =  ............. 
u8basis[4] =  .1.....1.....
u8basis[5] =  ...11.1.1..11

Set the high bit of non ASCII UTF-8 codepoint's prefix and suffix byte as 1.

u8basis[7] = nonASCII;
u8basis[7] = 1111.111.11..

The second highest bit of prefix (in case of non ASCII) and bit 6 in case of ASCII is 1.

u8basis[6] = pb.createOr(pb.createAnd(u8initial, nonASCII), ASCIIbit6, "basis6");
ubasis[6]  = 1....1...1..1

For any 3 or 4-byte sequence, the 3rd highest bit of the prefix is set to 1.

u8basis[5] = pb.createOr(u8basis[5], pb.createAnd(u8initial, pb.createNot(u8mask6_11)), "basis5");
u8basis[5] = 1..1111.1..11

For any 3 or 4-byte sequences, the 4th highest bit of the prefix is set to 1.

u8basis[4] = pb.createOr(u8basis[4], pb.createAnd(u8initial, pb.createNot(u8mask12_17)), "basis4");
u8basis[4] = 11.....1.....

Transposing these u8basis bit streams provides us with the result sequence of UTF-8 code units.