cameron · fa2bfbd2
Hide whitespace changes
Inline Side-by-side

Showing with 268 additions and 0 deletions

ParabixTransform.md ParabixTransform.md +268 -0

No files found.
--- a/ParabixTransform.md
+++ b/ParabixTransform.md
+# The Parabix Transform 
+
+The Parabix Transform transposes ordinary character stream data into sets of bit-parallel data streams, each stream containing one bit per original character code unit position.
+Most often, byte-oriented streams of data in representations such as ASCII, ISO Latin 1 or UTF-8 are transposed into sets of 8 parallel bit streams, one stream for each bit of the corresponding code units.   However, the concept also applies for character data streams
+using larger code units, such as those using the 16-bit code units of UTF-16 or the 32-bit code units of UTF-32.  In each case, the transposed representation consists of one bit stream each for each bit position in the code units.   In the case of UTF-32, however, implementations typically produce only 21 bit streams, as the high 11 bits of each 32-bit code unit are always zero.
+
+Depending on the available processor architecture, there are many ways that the Parabix
+Transform may be implemented.  Although the transformation can be performed in a serial
+fashion extracting one bit at a time from each byte, the overhead of the serial approach greatly limits its usefulness.   In this section, we concentrate instead on parallel methods that impose a relatively small overhead on the processing of character data streams.
+
+## Ideal Three-Stage Parallel Transposition for Byte Streams 
+
+The Parabix Transform to transform byte-oriented character stream data to bit-parallel data streams
+can be implemented in a parallel fashion using the following
+three-stage transposition strategy.
+
+  1.  Transform the byte stream data into two parallel nybble streams, one for the high nybble of each byte, and one for the low nybble of each byte.   (A nybble is one-half of a byte, i.e., 4 bits).
+  2.  Transform each of the nybble streams into two parallel bit-pair streams.  Each bit-pair stream consists of a stream of 2-bit units in one-to-one correspondence with the input byte stream.   The bit-pairs in the first such stream consist of bits 0 and 1 of each byte, the bit-pairs in the second stream consist of bits 2 and 3 of each byte, the bit-pairs in the third stream consist of bits 4 and 5 of each byte and the bit-pairs in the fourth and final bit-pair stream consist of the bits 6 and 7 of each byte.
+  3.  Transform each of the bit-pair streams into two individual bit streams. 
+
+
+### Ideal Implementation Using SIMD Packing
+
+Given a processor architecture that supports binary SIMD operations on 2^k^-bit
+registers, an ideal implementation of the Parabix Transposition can be based
+on the availability of a family of horizontal packing operations.  The operations
+required each have the IDISA pattern `hsimd<{2,4,8}>::pack{h,l}(e1, e2)`,
+in which `e1` and `e2` are 2^k^-bit input registers,
+the field width of the fields processed is either 2, 4, or 8, and
+the packing operation selects the bits comprising either the high (`h`) or low (`l`) half of each field.   The result in each case is a single 2^k^-bit value
+comprising the packed bits that are selected.    For example,
+`hsimd<2>::packh(e1, e2)` selects the high bit of each 2-bit field
+in the concatenation of `e1` and `e2`, returning the packed set of 
+2^k^ bits as a single 2^k^-bit value.   
+The following example illustrates this operation working with 16-bit (k=4) registers.
+
+| Expression | Field 1 | Field 2 |
+| ---------- | ------- | ------- |
+|`e1`|`AaBbCcDd`|`EeFfGgHh`|
+|`e2`|`JjKkLlMm`|`NnPpQqRr`|
+|`hsimd<2>::packh(e1, e2)`|`ABCDEFGH`|`JKLMNPQR`|
+
+Similarly, `hsimd<8>::packl(e1, e2)`
+selects the low 4-bits of each 8-bit field in the concatenation of `e1` and `e2`,
+again returning the result as a single 2^k^-bit value, as illustrated by the following example.
+
+| Expression | Field 1 | Field 2 |
+| ---------- | ------- | ------- |
+|`e1`|`AaBbCcDd`|`EeFfGgHh`|
+|`e2`|`JjKkLlMm`|`NnPpQqRr`|
+|`hsimd<8>::packl(e1, e2)`|`CdDdGgHh`|`LlMmQqRr`|
+
+Using these operations it is possible to implement the ideal transposition
+strategy in a 
+straightforward fashion.   Given a 2^k^-byte sequence held consecutively
+in 8 registers `s0`, `s1`, … `s7`, the following 
+3-step transformation process performs transposition to parallel bit streams.
+
+```
+// Transpose to 2 nybble streams
+lo_nybble0 = hsimd::packl(s0, s1);
+lo_nybble1 = hsimd::packl(s2, s3);
+lo_nybble3 = hsimd::packl(s4, s5);
+lo_nybble4 = hsimd::packl(s6, s7);
+hi_nybble0 = hsimd::packh(s0, s1);
+hi_nybble1 = hsimd::packh(s2, s3);
+hi_nybble3 = hsimd::packh(s4, s5);
+hi_nybble4 = hsimd::packh(s6, s7);
+// Transpose 2 nybble streams to 4 bit-pair streams.
+bit01pair_0 = hsimd::packl(lo_nybble0, lo_nybble1);
+bit01pair_1 = hsimd::packl(lo_nybble2, lo_nybble3);
+bit23pair_0 = hsimd::packh(lo_nybble0, lo_nybble1);
+bit23pair_1 = hsimd::packh(lo_nybble2, lo_nybble3);
+bit45pair_0 = hsimd::packl(hi_nybble0, hi_nybble1);
+bit45pair_1 = hsimd::packl(hi_nybble2, hi_nybble3);
+bit67pair_0 = hsimd::packh(hi_nybble0, hi_nybble1);
+bit67pair_1 = hsimd::packh(hi_nybble2, hi_nybble3);
+// Transpose 4 bit-pairs streams to 8 bit streams.
+bit0 = hsimd::packl(bit01pair_0, bit01pair_1);
+bit1 = hsimd::packh(bit01pair_0, bit01pair_1);
+bit2 = hsimd::packl(bit23pair_0, bit23pair_1);
+bit3 = hsimd::packh(bit23pair_0, bit23pair_1);
+bit4 = hsimd::packl(bit45pair_0, bit45pair_1);
+bit5 = hsimd::packh(bit45pair_0, bit45pair_1);
+bit6 = hsimd::packl(bit67pair_0, bit67pair_1);
+bit7 = hsimd::packh(bit67pair_0, bit67pair_1);
+```
+
+Overall, transposition requires 8 pack operations for each of the three transposition steps, for a total
+of 24 operations for the entire process.   
+
+### Implementation of Ideal Packing Using Bit Shuffles
+
+Unfortunately, the `hsimd<{2,4,8}>::pack{h,l}(e1, e2)` family of operations for
+ideal transposition are not to be found within the SIMD instruction set of current
+commodity processors.  However, if ''bit shuffle'' operations are to be found instead,
+then the pack operations can be expressed directly in terms of equivalent 
+bit shuffles.   Bit shuffles can be found both at the intermediate representation
+level with the LLVM `shufflevector` operation and at the processor ISA level
+in terms of the Haswell new instruction `pext`.
+
+The LLVM `shufflevector` operation allows a result vector to be populated
+by directly selecting elements from a concatenated pair of input vectors.
+A constant vector of `i32` selectors lets each vector element be selected
+from any of the positions within either of the two input vectors.   For example,
+working with 8-bit input vectors (for simplicity of the example), the 
+`hsimd<2>::pack{h,l}(e1, e2)` operations may be translated directly into
+`shufflevector` operations.
+```
+define <8 x i1> @hsimd_packh_2(<8 x i1> %x, <8 x i1> %y) {
+   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
+                <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
+   return <8 x i1> result
+}
+define <8 x i1> @hsimd_packl_2(<8 x i1> %x, <8 x i1> %y) {
+   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
+                <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
+   return <8 x i1> result
+}
+```
+
+Similarly, it is straightforward to define the additional `hsimd<{4,8}>::pack{h,l}(e1, e2)`
+operations on 8-bit registers as follows.
+```
+define <8 x i1> @hsimd_packh_4(<8 x i1> %x, <8 x i1> %y) {
+   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
+                <8 x i32> <i32 2, i32 3, i32 6, i32 7, i32 10, i32 11, i32 14, i32 15>
+   return <8 x i1> result
+}
+define <8 x i1> @hsimd_packl_4(<8 x i1> %x, <8 x i1> %y) {
+   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
+                <8 x i32> <i32 0, i32 1, i32 4, i32 5, i32 8, i32 9, i32 12, i32 13>
+   return <8 x i1> result
+}
+```
+```
+define <8 x i1> @hsimd_packh_8(<8 x i1> %x, <8 x i1> %y) {
+   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
+                <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15>
+   return <8 x i1> result
+}
+define <8 x i1> @hsimd_packl_8(<8 x i1> %x, <8 x i1> %y) {
+   %result = shufflevector <8 x i1> %x, <8 x i1> %y, 
+                <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
+   return <8 x i1> result
+}
+```
+
+Alternatively, we may implement `hsimd<{4,8}>::pack{h,l}(e1, e2)` using
+`shufflevector` with `i2` and `i4` vectors.
+
+```
+define <4 x i2> @hsimd_packh_4(<4 x i2> %x, <4 x i2> %y) {
+   %result = shufflevector <4 x i2> %x, <4 x i2> %y, 
+                <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+   return <4 x i2> result
+}
+define <4 x i2> @hsimd_packl_4(<4 x i2> %x, <4 x i2> %y) {
+   %result = shufflevector <4 x i2> %x, <4 x i2> %y, 
+                <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+   return <4 x i2> result
+}
+
+define <2 x i4> @hsimd_packh_8(<2 x i4> %x, <2 x i4> %y) {
+   %result = shufflevector <2 x i4> %x, <2 x i4> %y, 
+                <2 x i32> <i32 1, i32 3>
+   return <2 x i4> result
+}
+define <2 x i4> @hsimd_packl_8(<2 x i4> %x, <2 x i4> %y) {
+   %result = shufflevector <2 x i4> %x, <2 x i4> %y, 
+                <2 x i32> <i32 0, i32 2>
+   return <2 x i4> result
+}
+```
+
+## Byte Pack Implementation of the Parabix Transform
+
+Although SIMD units typically do not provide direct bit packing implementations,
+''byte packing'' operations that extract bytes from 16-bit fields are common.
+The `simd<16>::packh` and `simd<16>::packl` operations on 64-bit vectors
+arranged into 16-bit fields are illustrated by the following examples.
+
+| Expression | Field 1 | Field 2 | Field 3 | Field 4 |
+| ---------- | ------- | ------- | ------- | ------- |
+| s0 | 0x0123 | 0x4567 | 0x89AB | 0xCDEF |
+| s1 | 0xaabb | 0xccdd | 0xeeff | 0x0011 |
+| simd<16>::packh(s0, s1) | 0x0145 | 0x89CD | 0xaacc | 0xee00 ||
+| simd<16>::packl(s0, s1) | 0x2367 | 0xABEF| 0xbbdd | 0xff11 ||
+ 
+### Three-Stage Transposition Using Byte Packing ###
+
+The following library code of the Parabix system implements three-stage
+transposition using byte packing.
+
+```
+#define s2p_step(s0, s1, hi_mask, shift, p0, p1)  \
+  do {\
+	BitBlock t0,t1;\
+	t0 = hsimd<16>::packh(s0, s1);\
+	t1 = hsimd<16>::packl(s0, s1);\
+	p0 = simd<1>::ifh(hi_mask, t0, simd<16>::srli<shift>(t1));\
+	p1 = simd<1>::ifh(hi_mask, simd<16>::slli<shift>(t0), t1);\
+  } while(0)
+
+#define s2p_bytepack(s0, s1, s2, s3, s4, s5, s6, s7, p0, p1, p2, p3, p4, p5, p6, p7) \
+  do {\
+	BitBlock bit00224466_0,bit00224466_1,bit00224466_2,bit00224466_3;\
+	BitBlock bit11335577_0,bit11335577_1,bit11335577_2,bit11335577_3;\
+	BitBlock bit00004444_0,bit22226666_0,bit00004444_1,bit22226666_1;\
+	BitBlock bit11115555_0,bit33337777_0,bit11115555_1,bit33337777_1;\
+	s2p_step(s0,s1,simd<2>::himask(),1,bit00224466_0,bit11335577_0);\
+	s2p_step(s2,s3,simd<2>::himask(),1,bit00224466_1,bit11335577_1);\
+	s2p_step(s4,s5,simd<2>::himask(),1,bit00224466_2,bit11335577_2);\
+	s2p_step(s6,s7,simd<2>::himask(),1,bit00224466_3,bit11335577_3);\
+	s2p_step(bit00224466_0,bit00224466_1,simd<4>::himask(),2,bit00004444_0,bit22226666_0);\
+	s2p_step(bit00224466_2,bit00224466_3,simd<4>::himask(),2,bit00004444_1,bit22226666_1);\
+	s2p_step(bit11335577_0,bit11335577_1,simd<4>::himask(),2,bit11115555_0,bit33337777_0);\
+	s2p_step(bit11335577_2,bit11335577_3,simd<4>::himask(),2,bit11115555_1,bit33337777_1);\
+	s2p_step(bit00004444_0,bit00004444_1,simd<8>::himask(),4,p0,p4);\
+	s2p_step(bit11115555_0,bit11115555_1,simd<8>::himask(),4,p1,p5);\
+	s2p_step(bit22226666_0,bit22226666_1,simd<8>::himask(),4,p2,p6);\
+	s2p_step(bit33337777_0,bit33337777_1,simd<8>::himask(),4,p3,p7);\
+  } while(0)
+```
+
+
+### Byte Pack Using SSE2 `packuswb`
+
+For example the SSE2 instruction `packuswb` packs 16-byte integers into
+bytes using unsigned saturation.   By clearing the high 8 bits of the 16-bit
+field, this operation can be efficiently used to implement both the
+`hsimd<16>::packh` and `hsimd<16>::packl` operations.
+
+```
+template <>  bitblock_t hsimd<16>::packh(bitblock_t arg1, bitblock_t arg2)
+{
+  return _mm_packus_epi16(_mm_srli_epi16(arg2, 8), _mm_srli_epi16(arg1, 8));
+}
+
+template <> bitblock_t hsimd<16>::packl(bitblock_t arg1, bitblock_t arg2)
+{
+  const bitblock_t lomask = simd<16>::constant(0x00FF);       
+  return _mm_packus_epi16(simd_and(arg2, lomask), simd_and(arg1, lomask));
+}
+```
+
+### Byte Pack Using Byte Shuffling ###
+
+Byte pack operations are also conveniently modeled as byte shuffle operations.
+Thus, LLVM versions of pack are straightforward.  
+```
+define <8 x i16> @hsimd_packh_16(<8 x i16> %x, <8 x i16> %y) {
+   %result = shufflevector <8 x i16> %x, <8 x i16> %y, 
+                <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
+   return <8 x i16> result
+}
+define <8 x i16> @hsimd_packl_16(<8 x i16> %x, <8 x i16> %y) {
+   %result = shufflevector <8 x i16> %x, <8 x i16> %y, 
+                <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
+   return <8 x i16> result
+}
+```
+
+Given `shufflevector` operations satisfying the byte pack model,
+an LLVM code generator can conceivably produce the SSE2-based implementations
+shown previously, while an SSE3-based implementation might directly use
+the byte shuffle operation `pshufb`.
+