ParabixTransform

The Parabix Transform

The Parabix Transform transposes ordinary character stream data into sets of bit-parallel data streams, each stream containing one bit per original character code unit position. Most often, byte-oriented streams of data in representations such as ASCII, ISO Latin 1 or UTF-8 are transposed into sets of 8 parallel bit streams, one stream for each bit of the corresponding code units. However, the concept also applies for character data streams using larger code units, such as those using the 16-bit code units of UTF-16 or the 32-bit code units of UTF-32. In each case, the transposed representation consists of one bit stream each for each bit position in the code units. In the case of UTF-32, however, implementations typically produce only 21 bit streams, as the high 11 bits of each 32-bit code unit are always zero.

Depending on the available processor architecture, there are many ways that the Parabix Transform may be implemented. Although the transformation can be performed in a serial fashion extracting one bit at a time from each byte, the overhead of the serial approach greatly limits its usefulness. In this section, we concentrate instead on parallel methods that impose a relatively small overhead on the processing of character data streams.

Ideal Three-Stage Parallel Transposition for Byte Streams

The Parabix Transform to transform byte-oriented character stream data to bit-parallel data streams can be implemented in a parallel fashion using the following three-stage transposition strategy.

Transform the byte stream data into two parallel nybble streams, one for the high nybble of each byte, and one for the low nybble of each byte. (A nybble is one-half of a byte, i.e., 4 bits).
Transform each of the nybble streams into two parallel bit-pair streams. Each bit-pair stream consists of a stream of 2-bit units in one-to-one correspondence with the input byte stream. The bit-pairs in the first such stream consist of bits 0 and 1 of each byte, the bit-pairs in the second stream consist of bits 2 and 3 of each byte, the bit-pairs in the third stream consist of bits 4 and 5 of each byte and the bit-pairs in the fourth and final bit-pair stream consist of the bits 6 and 7 of each byte.
Transform each of the bit-pair streams into two individual bit streams.

Ideal Implementation Using SIMD Packing

Given a processor architecture that supports binary SIMD operations on 2^k^-bit registers, an ideal implementation of the Parabix Transposition can be based on the availability of a family of horizontal packing operations. The operations required each have the IDISA pattern hsimd<{2,4,8}>::pack{h,l}(e1, e2), in which e1 and e2 are 2^k^-bit input registers, the field width of the fields processed is either 2, 4, or 8, and the packing operation selects the bits comprising either the high (h) or low (l) half of each field. The result in each case is a single 2^k^-bit value comprising the packed bits that are selected. For example, hsimd<2>::packh(e1, e2) selects the high bit of each 2-bit field in the concatenation of e1 and e2, returning the packed set of 2^k^ bits as a single 2^k^-bit value.
The following example illustrates this operation working with 16-bit (k=4) registers.

Expression	Field 1	Field 2
`e1`	`AaBbCcDd`	`EeFfGgHh`
`e2`	`JjKkLlMm`	`NnPpQqRr`
`hsimd<2>::packh(e1, e2)`	`ABCDEFGH`	`JKLMNPQR`

Similarly, hsimd<8>::packl(e1, e2) selects the low 4-bits of each 8-bit field in the concatenation of e1 and e2, again returning the result as a single 2^k^-bit value, as illustrated by the following example.

Expression	Field 1	Field 2
`e1`	`AaBbCcDd`	`EeFfGgHh`
`e2`	`JjKkLlMm`	`NnPpQqRr`
`hsimd<8>::packl(e1, e2)`	`CdDdGgHh`	`LlMmQqRr`

Using these operations it is possible to implement the ideal transposition strategy in a straightforward fashion. Given a 2^k^-byte sequence held consecutively in 8 registers s0, s1, … s7, the following 3-step transformation process performs transposition to parallel bit streams.

// Transpose to 2 nybble streams
lo_nybble0 = hsimd::packl(s0, s1);
lo_nybble1 = hsimd::packl(s2, s3);
lo_nybble3 = hsimd::packl(s4, s5);
lo_nybble4 = hsimd::packl(s6, s7);
hi_nybble0 = hsimd::packh(s0, s1);
hi_nybble1 = hsimd::packh(s2, s3);
hi_nybble3 = hsimd::packh(s4, s5);
hi_nybble4 = hsimd::packh(s6, s7);
// Transpose 2 nybble streams to 4 bit-pair streams.
bit01pair_0 = hsimd::packl(lo_nybble0, lo_nybble1);
bit01pair_1 = hsimd::packl(lo_nybble2, lo_nybble3);
bit23pair_0 = hsimd::packh(lo_nybble0, lo_nybble1);
bit23pair_1 = hsimd::packh(lo_nybble2, lo_nybble3);
bit45pair_0 = hsimd::packl(hi_nybble0, hi_nybble1);
bit45pair_1 = hsimd::packl(hi_nybble2, hi_nybble3);
bit67pair_0 = hsimd::packh(hi_nybble0, hi_nybble1);
bit67pair_1 = hsimd::packh(hi_nybble2, hi_nybble3);
// Transpose 4 bit-pairs streams to 8 bit streams.
bit0 = hsimd::packl(bit01pair_0, bit01pair_1);
bit1 = hsimd::packh(bit01pair_0, bit01pair_1);
bit2 = hsimd::packl(bit23pair_0, bit23pair_1);
bit3 = hsimd::packh(bit23pair_0, bit23pair_1);
bit4 = hsimd::packl(bit45pair_0, bit45pair_1);
bit5 = hsimd::packh(bit45pair_0, bit45pair_1);
bit6 = hsimd::packl(bit67pair_0, bit67pair_1);
bit7 = hsimd::packh(bit67pair_0, bit67pair_1);

Overall, transposition requires 8 pack operations for each of the three transposition steps, for a total of 24 operations for the entire process.

Implementation of Ideal Packing Using Bit Shuffles

Unfortunately, the hsimd<{2,4,8}>::pack{h,l}(e1, e2) family of operations for ideal transposition are not to be found within the SIMD instruction set of current commodity processors. However, if ''bit shuffle'' operations are to be found instead, then the pack operations can be expressed directly in terms of equivalent bit shuffles. Bit shuffles can be found both at the intermediate representation level with the LLVM shufflevector operation and at the processor ISA level in terms of the Haswell new instruction pext.

The LLVM shufflevector operation allows a result vector to be populated by directly selecting elements from a concatenated pair of input vectors. A constant vector of i32 selectors lets each vector element be selected from any of the positions within either of the two input vectors. For example, working with 8-bit input vectors (for simplicity of the example), the hsimd<2>::pack{h,l}(e1, e2) operations may be translated directly into shufflevector operations.

define <8 x i1> @hsimd_packh_2(<8 x i1> %x, <8 x i1> %y) {
   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
                <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
   return <8 x i1> result
}
define <8 x i1> @hsimd_packl_2(<8 x i1> %x, <8 x i1> %y) {
   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
                <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
   return <8 x i1> result
}

Similarly, it is straightforward to define the additional hsimd<{4,8}>::pack{h,l}(e1, e2) operations on 8-bit registers as follows.

define <8 x i1> @hsimd_packh_4(<8 x i1> %x, <8 x i1> %y) {
   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
                <8 x i32> <i32 2, i32 3, i32 6, i32 7, i32 10, i32 11, i32 14, i32 15>
   return <8 x i1> result
}
define <8 x i1> @hsimd_packl_4(<8 x i1> %x, <8 x i1> %y) {
   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
                <8 x i32> <i32 0, i32 1, i32 4, i32 5, i32 8, i32 9, i32 12, i32 13>
   return <8 x i1> result
}

define <8 x i1> @hsimd_packh_8(<8 x i1> %x, <8 x i1> %y) {
   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
                <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15>
   return <8 x i1> result
}
define <8 x i1> @hsimd_packl_8(<8 x i1> %x, <8 x i1> %y) {
   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
                <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
   return <8 x i1> result
}

Alternatively, we may implement hsimd<{4,8}>::pack{h,l}(e1, e2) using shufflevector with i2 and i4 vectors.

define <4 x i2> @hsimd_packh_4(<4 x i2> %x, <4 x i2> %y) {
   %result = shufflevector <4 x i2> %x, <4 x i2> %y, 
                <4 x i32> <i32 1, i32 3, i32 5, i32 7>
   return <4 x i2> result
}
define <4 x i2> @hsimd_packl_4(<4 x i2> %x, <4 x i2> %y) {
   %result = shufflevector <4 x i2> %x, <4 x i2> %y, 
                <4 x i32> <i32 0, i32 2, i32 4, i32 6>
   return <4 x i2> result
}

define <2 x i4> @hsimd_packh_8(<2 x i4> %x, <2 x i4> %y) {
   %result = shufflevector <2 x i4> %x, <2 x i4> %y, 
                <2 x i32> <i32 1, i32 3>
   return <2 x i4> result
}
define <2 x i4> @hsimd_packl_8(<2 x i4> %x, <2 x i4> %y) {
   %result = shufflevector <2 x i4> %x, <2 x i4> %y, 
                <2 x i32> <i32 0, i32 2>
   return <2 x i4> result
}

Byte Pack Implementation of the Parabix Transform

Although SIMD units typically do not provide direct bit packing implementations, ''byte packing'' operations that extract bytes from 16-bit fields are common. The simd<16>::packh and simd<16>::packl operations on 64-bit vectors arranged into 16-bit fields are illustrated by the following examples.

Expression	Field 1	Field 2	Field 3	Field 4
s0	0x0123	0x4567	0x89AB	0xCDEF
s1	0xaabb	0xccdd	0xeeff	0x0011
simd<16>::packh(s0, s1)	0x0145	0x89CD	0xaacc	0xee00
simd<16>::packl(s0, s1)	0x2367	0xABEF	0xbbdd	0xff11

Three-Stage Transposition Using Byte Packing

The following library code of the Parabix system implements three-stage transposition using byte packing.

#define s2p_step(s0, s1, hi_mask, shift, p0, p1)  \
  do {\
	BitBlock t0,t1;\
	t0 = hsimd<16>::packh(s0, s1);\
	t1 = hsimd<16>::packl(s0, s1);\
	p0 = simd<1>::ifh(hi_mask, t0, simd<16>::srli<shift>(t1));\
	p1 = simd<1>::ifh(hi_mask, simd<16>::slli<shift>(t0), t1);\
  } while(0)

#define s2p_bytepack(s0, s1, s2, s3, s4, s5, s6, s7, p0, p1, p2, p3, p4, p5, p6, p7) \
  do {\
	BitBlock bit00224466_0,bit00224466_1,bit00224466_2,bit00224466_3;\
	BitBlock bit11335577_0,bit11335577_1,bit11335577_2,bit11335577_3;\
	BitBlock bit00004444_0,bit22226666_0,bit00004444_1,bit22226666_1;\
	BitBlock bit11115555_0,bit33337777_0,bit11115555_1,bit33337777_1;\
	s2p_step(s0,s1,simd<2>::himask(),1,bit00224466_0,bit11335577_0);\
	s2p_step(s2,s3,simd<2>::himask(),1,bit00224466_1,bit11335577_1);\
	s2p_step(s4,s5,simd<2>::himask(),1,bit00224466_2,bit11335577_2);\
	s2p_step(s6,s7,simd<2>::himask(),1,bit00224466_3,bit11335577_3);\
	s2p_step(bit00224466_0,bit00224466_1,simd<4>::himask(),2,bit00004444_0,bit22226666_0);\
	s2p_step(bit00224466_2,bit00224466_3,simd<4>::himask(),2,bit00004444_1,bit22226666_1);\
	s2p_step(bit11335577_0,bit11335577_1,simd<4>::himask(),2,bit11115555_0,bit33337777_0);\
	s2p_step(bit11335577_2,bit11335577_3,simd<4>::himask(),2,bit11115555_1,bit33337777_1);\
	s2p_step(bit00004444_0,bit00004444_1,simd<8>::himask(),4,p0,p4);\
	s2p_step(bit11115555_0,bit11115555_1,simd<8>::himask(),4,p1,p5);\
	s2p_step(bit22226666_0,bit22226666_1,simd<8>::himask(),4,p2,p6);\
	s2p_step(bit33337777_0,bit33337777_1,simd<8>::himask(),4,p3,p7);\
  } while(0)

Byte Pack Using SSE2 `packuswb`

For example the SSE2 instruction packuswb packs 16-byte integers into bytes using unsigned saturation. By clearing the high 8 bits of the 16-bit field, this operation can be efficiently used to implement both the hsimd<16>::packh and hsimd<16>::packl operations.

template <>  bitblock_t hsimd<16>::packh(bitblock_t arg1, bitblock_t arg2)
{
  return _mm_packus_epi16(_mm_srli_epi16(arg2, 8), _mm_srli_epi16(arg1, 8));
}

template <> bitblock_t hsimd<16>::packl(bitblock_t arg1, bitblock_t arg2)
{
  const bitblock_t lomask = simd<16>::constant(0x00FF);       
  return _mm_packus_epi16(simd_and(arg2, lomask), simd_and(arg1, lomask));
}

Byte Pack Using Byte Shuffling

Byte pack operations are also conveniently modeled as byte shuffle operations. Thus, LLVM versions of pack are straightforward.

define <8 x i16> @hsimd_packh_16(<8 x i16> %x, <8 x i16> %y) {
   %result = shufflevector <8 x i16> %x, <8 x i16> %y, 
                <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
   return <8 x i16> result
}
define <8 x i16> @hsimd_packl_16(<8 x i16> %x, <8 x i16> %y) {
   %result = shufflevector <8 x i16> %x, <8 x i16> %y, 
                <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
   return <8 x i16> result
}

Given shufflevector operations satisfying the byte pack model, an LLVM code generator can conceivably produce the SSE2-based implementations shown previously, while an SSE3-based implementation might directly use the byte shuffle operation pshufb.