Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
P parabix-devel
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 9
    • Issues 9
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 2
    • Merge requests 2
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • cameron
  • parabix-devel
  • Wiki
  • ParabixTransform

ParabixTransform · Changes

Page history
Create ParabixTransform authored Mar 15, 2020 by cameron's avatar cameron
Hide whitespace changes
Inline Side-by-side
Showing with 268 additions and 0 deletions
+268 -0
  • ParabixTransform.md ParabixTransform.md +268 -0
  • No files found.
ParabixTransform.md 0 → 100644
View page @ fa2bfbd2
# The Parabix Transform
The Parabix Transform transposes ordinary character stream data into sets of bit-parallel data streams, each stream containing one bit per original character code unit position.
Most often, byte-oriented streams of data in representations such as ASCII, ISO Latin 1 or UTF-8 are transposed into sets of 8 parallel bit streams, one stream for each bit of the corresponding code units. However, the concept also applies for character data streams
using larger code units, such as those using the 16-bit code units of UTF-16 or the 32-bit code units of UTF-32. In each case, the transposed representation consists of one bit stream each for each bit position in the code units. In the case of UTF-32, however, implementations typically produce only 21 bit streams, as the high 11 bits of each 32-bit code unit are always zero.
Depending on the available processor architecture, there are many ways that the Parabix
Transform may be implemented. Although the transformation can be performed in a serial
fashion extracting one bit at a time from each byte, the overhead of the serial approach greatly limits its usefulness. In this section, we concentrate instead on parallel methods that impose a relatively small overhead on the processing of character data streams.
## Ideal Three-Stage Parallel Transposition for Byte Streams
The Parabix Transform to transform byte-oriented character stream data to bit-parallel data streams
can be implemented in a parallel fashion using the following
three-stage transposition strategy.
1. Transform the byte stream data into two parallel nybble streams, one for the high nybble of each byte, and one for the low nybble of each byte. (A nybble is one-half of a byte, i.e., 4 bits).
2. Transform each of the nybble streams into two parallel bit-pair streams. Each bit-pair stream consists of a stream of 2-bit units in one-to-one correspondence with the input byte stream. The bit-pairs in the first such stream consist of bits 0 and 1 of each byte, the bit-pairs in the second stream consist of bits 2 and 3 of each byte, the bit-pairs in the third stream consist of bits 4 and 5 of each byte and the bit-pairs in the fourth and final bit-pair stream consist of the bits 6 and 7 of each byte.
3. Transform each of the bit-pair streams into two individual bit streams.
### Ideal Implementation Using SIMD Packing
Given a processor architecture that supports binary SIMD operations on 2^k^-bit
registers, an ideal implementation of the Parabix Transposition can be based
on the availability of a family of horizontal packing operations. The operations
required each have the IDISA pattern `hsimd<{2,4,8}>::pack{h,l}(e1, e2)`,
in which `e1` and `e2` are 2^k^-bit input registers,
the field width of the fields processed is either 2, 4, or 8, and
the packing operation selects the bits comprising either the high (`h`) or low (`l`) half of each field. The result in each case is a single 2^k^-bit value
comprising the packed bits that are selected. For example,
`hsimd<2>::packh(e1, e2)` selects the high bit of each 2-bit field
in the concatenation of `e1` and `e2`, returning the packed set of
2^k^ bits as a single 2^k^-bit value.
The following example illustrates this operation working with 16-bit (k=4) registers.
| Expression | Field 1 | Field 2 |
| ---------- | ------- | ------- |
|`e1`|`AaBbCcDd`|`EeFfGgHh`|
|`e2`|`JjKkLlMm`|`NnPpQqRr`|
|`hsimd<2>::packh(e1, e2)`|`ABCDEFGH`|`JKLMNPQR`|
Similarly, `hsimd<8>::packl(e1, e2)`
selects the low 4-bits of each 8-bit field in the concatenation of `e1` and `e2`,
again returning the result as a single 2^k^-bit value, as illustrated by the following example.
| Expression | Field 1 | Field 2 |
| ---------- | ------- | ------- |
|`e1`|`AaBbCcDd`|`EeFfGgHh`|
|`e2`|`JjKkLlMm`|`NnPpQqRr`|
|`hsimd<8>::packl(e1, e2)`|`CdDdGgHh`|`LlMmQqRr`|
Using these operations it is possible to implement the ideal transposition
strategy in a
straightforward fashion. Given a 2^k^-byte sequence held consecutively
in 8 registers `s0`, `s1`, … `s7`, the following
3-step transformation process performs transposition to parallel bit streams.
```
// Transpose to 2 nybble streams
lo_nybble0 = hsimd::packl(s0, s1);
lo_nybble1 = hsimd::packl(s2, s3);
lo_nybble3 = hsimd::packl(s4, s5);
lo_nybble4 = hsimd::packl(s6, s7);
hi_nybble0 = hsimd::packh(s0, s1);
hi_nybble1 = hsimd::packh(s2, s3);
hi_nybble3 = hsimd::packh(s4, s5);
hi_nybble4 = hsimd::packh(s6, s7);
// Transpose 2 nybble streams to 4 bit-pair streams.
bit01pair_0 = hsimd::packl(lo_nybble0, lo_nybble1);
bit01pair_1 = hsimd::packl(lo_nybble2, lo_nybble3);
bit23pair_0 = hsimd::packh(lo_nybble0, lo_nybble1);
bit23pair_1 = hsimd::packh(lo_nybble2, lo_nybble3);
bit45pair_0 = hsimd::packl(hi_nybble0, hi_nybble1);
bit45pair_1 = hsimd::packl(hi_nybble2, hi_nybble3);
bit67pair_0 = hsimd::packh(hi_nybble0, hi_nybble1);
bit67pair_1 = hsimd::packh(hi_nybble2, hi_nybble3);
// Transpose 4 bit-pairs streams to 8 bit streams.
bit0 = hsimd::packl(bit01pair_0, bit01pair_1);
bit1 = hsimd::packh(bit01pair_0, bit01pair_1);
bit2 = hsimd::packl(bit23pair_0, bit23pair_1);
bit3 = hsimd::packh(bit23pair_0, bit23pair_1);
bit4 = hsimd::packl(bit45pair_0, bit45pair_1);
bit5 = hsimd::packh(bit45pair_0, bit45pair_1);
bit6 = hsimd::packl(bit67pair_0, bit67pair_1);
bit7 = hsimd::packh(bit67pair_0, bit67pair_1);
```
Overall, transposition requires 8 pack operations for each of the three transposition steps, for a total
of 24 operations for the entire process.
### Implementation of Ideal Packing Using Bit Shuffles
Unfortunately, the `hsimd<{2,4,8}>::pack{h,l}(e1, e2)` family of operations for
ideal transposition are not to be found within the SIMD instruction set of current
commodity processors. However, if ''bit shuffle'' operations are to be found instead,
then the pack operations can be expressed directly in terms of equivalent
bit shuffles. Bit shuffles can be found both at the intermediate representation
level with the LLVM `shufflevector` operation and at the processor ISA level
in terms of the Haswell new instruction `pext`.
The LLVM `shufflevector` operation allows a result vector to be populated
by directly selecting elements from a concatenated pair of input vectors.
A constant vector of `i32` selectors lets each vector element be selected
from any of the positions within either of the two input vectors. For example,
working with 8-bit input vectors (for simplicity of the example), the
`hsimd<2>::pack{h,l}(e1, e2)` operations may be translated directly into
`shufflevector` operations.
```
define <8 x i1> @hsimd_packh_2(<8 x i1> %x, <8 x i1> %y) {
%result = shufflevector <8 x i1> %x, <8 x i1> %y,
<8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
return <8 x i1> result
}
define <8 x i1> @hsimd_packl_2(<8 x i1> %x, <8 x i1> %y) {
%result = shufflevector <8 x i1> %x, <8 x i1> %y,
<8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
return <8 x i1> result
}
```
Similarly, it is straightforward to define the additional `hsimd<{4,8}>::pack{h,l}(e1, e2)`
operations on 8-bit registers as follows.
```
define <8 x i1> @hsimd_packh_4(<8 x i1> %x, <8 x i1> %y) {
%result = shufflevector <8 x i1> %x, <8 x i1> %y,
<8 x i32> <i32 2, i32 3, i32 6, i32 7, i32 10, i32 11, i32 14, i32 15>
return <8 x i1> result
}
define <8 x i1> @hsimd_packl_4(<8 x i1> %x, <8 x i1> %y) {
%result = shufflevector <8 x i1> %x, <8 x i1> %y,
<8 x i32> <i32 0, i32 1, i32 4, i32 5, i32 8, i32 9, i32 12, i32 13>
return <8 x i1> result
}
```
```
define <8 x i1> @hsimd_packh_8(<8 x i1> %x, <8 x i1> %y) {
%result = shufflevector <8 x i1> %x, <8 x i1> %y,
<8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15>
return <8 x i1> result
}
define <8 x i1> @hsimd_packl_8(<8 x i1> %x, <8 x i1> %y) {
%result = shufflevector <8 x i1> %x, <8 x i1> %y,
<8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
return <8 x i1> result
}
```
Alternatively, we may implement `hsimd<{4,8}>::pack{h,l}(e1, e2)` using
`shufflevector` with `i2` and `i4` vectors.
```
define <4 x i2> @hsimd_packh_4(<4 x i2> %x, <4 x i2> %y) {
%result = shufflevector <4 x i2> %x, <4 x i2> %y,
<4 x i32> <i32 1, i32 3, i32 5, i32 7>
return <4 x i2> result
}
define <4 x i2> @hsimd_packl_4(<4 x i2> %x, <4 x i2> %y) {
%result = shufflevector <4 x i2> %x, <4 x i2> %y,
<4 x i32> <i32 0, i32 2, i32 4, i32 6>
return <4 x i2> result
}
define <2 x i4> @hsimd_packh_8(<2 x i4> %x, <2 x i4> %y) {
%result = shufflevector <2 x i4> %x, <2 x i4> %y,
<2 x i32> <i32 1, i32 3>
return <2 x i4> result
}
define <2 x i4> @hsimd_packl_8(<2 x i4> %x, <2 x i4> %y) {
%result = shufflevector <2 x i4> %x, <2 x i4> %y,
<2 x i32> <i32 0, i32 2>
return <2 x i4> result
}
```
## Byte Pack Implementation of the Parabix Transform
Although SIMD units typically do not provide direct bit packing implementations,
''byte packing'' operations that extract bytes from 16-bit fields are common.
The `simd<16>::packh` and `simd<16>::packl` operations on 64-bit vectors
arranged into 16-bit fields are illustrated by the following examples.
| Expression | Field 1 | Field 2 | Field 3 | Field 4 |
| ---------- | ------- | ------- | ------- | ------- |
| s0 | 0x0123 | 0x4567 | 0x89AB | 0xCDEF |
| s1 | 0xaabb | 0xccdd | 0xeeff | 0x0011 |
| simd<16>::packh(s0, s1) | 0x0145 | 0x89CD | 0xaacc | 0xee00 ||
| simd<16>::packl(s0, s1) | 0x2367 | 0xABEF| 0xbbdd | 0xff11 ||
### Three-Stage Transposition Using Byte Packing ###
The following library code of the Parabix system implements three-stage
transposition using byte packing.
```
#define s2p_step(s0, s1, hi_mask, shift, p0, p1) \
do {\
BitBlock t0,t1;\
t0 = hsimd<16>::packh(s0, s1);\
t1 = hsimd<16>::packl(s0, s1);\
p0 = simd<1>::ifh(hi_mask, t0, simd<16>::srli<shift>(t1));\
p1 = simd<1>::ifh(hi_mask, simd<16>::slli<shift>(t0), t1);\
} while(0)
#define s2p_bytepack(s0, s1, s2, s3, s4, s5, s6, s7, p0, p1, p2, p3, p4, p5, p6, p7) \
do {\
BitBlock bit00224466_0,bit00224466_1,bit00224466_2,bit00224466_3;\
BitBlock bit11335577_0,bit11335577_1,bit11335577_2,bit11335577_3;\
BitBlock bit00004444_0,bit22226666_0,bit00004444_1,bit22226666_1;\
BitBlock bit11115555_0,bit33337777_0,bit11115555_1,bit33337777_1;\
s2p_step(s0,s1,simd<2>::himask(),1,bit00224466_0,bit11335577_0);\
s2p_step(s2,s3,simd<2>::himask(),1,bit00224466_1,bit11335577_1);\
s2p_step(s4,s5,simd<2>::himask(),1,bit00224466_2,bit11335577_2);\
s2p_step(s6,s7,simd<2>::himask(),1,bit00224466_3,bit11335577_3);\
s2p_step(bit00224466_0,bit00224466_1,simd<4>::himask(),2,bit00004444_0,bit22226666_0);\
s2p_step(bit00224466_2,bit00224466_3,simd<4>::himask(),2,bit00004444_1,bit22226666_1);\
s2p_step(bit11335577_0,bit11335577_1,simd<4>::himask(),2,bit11115555_0,bit33337777_0);\
s2p_step(bit11335577_2,bit11335577_3,simd<4>::himask(),2,bit11115555_1,bit33337777_1);\
s2p_step(bit00004444_0,bit00004444_1,simd<8>::himask(),4,p0,p4);\
s2p_step(bit11115555_0,bit11115555_1,simd<8>::himask(),4,p1,p5);\
s2p_step(bit22226666_0,bit22226666_1,simd<8>::himask(),4,p2,p6);\
s2p_step(bit33337777_0,bit33337777_1,simd<8>::himask(),4,p3,p7);\
} while(0)
```
### Byte Pack Using SSE2 `packuswb`
For example the SSE2 instruction `packuswb` packs 16-byte integers into
bytes using unsigned saturation. By clearing the high 8 bits of the 16-bit
field, this operation can be efficiently used to implement both the
`hsimd<16>::packh` and `hsimd<16>::packl` operations.
```
template <> bitblock_t hsimd<16>::packh(bitblock_t arg1, bitblock_t arg2)
{
return _mm_packus_epi16(_mm_srli_epi16(arg2, 8), _mm_srli_epi16(arg1, 8));
}
template <> bitblock_t hsimd<16>::packl(bitblock_t arg1, bitblock_t arg2)
{
const bitblock_t lomask = simd<16>::constant(0x00FF);
return _mm_packus_epi16(simd_and(arg2, lomask), simd_and(arg1, lomask));
}
```
### Byte Pack Using Byte Shuffling ###
Byte pack operations are also conveniently modeled as byte shuffle operations.
Thus, LLVM versions of pack are straightforward.
```
define <8 x i16> @hsimd_packh_16(<8 x i16> %x, <8 x i16> %y) {
%result = shufflevector <8 x i16> %x, <8 x i16> %y,
<8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
return <8 x i16> result
}
define <8 x i16> @hsimd_packl_16(<8 x i16> %x, <8 x i16> %y) {
%result = shufflevector <8 x i16> %x, <8 x i16> %y,
<8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
return <8 x i16> result
}
```
Given `shufflevector` operations satisfying the byte pack model,
an LLVM code generator can conceivably produce the SSE2-based implementations
shown previously, while an SSE3-based implementation might directly use
the byte shuffle operation `pshufb`.
Clone repository
  • Bracket Matching
  • CSV Validation
  • CSVediting
  • CSVparsing
  • Character Code Compilers
  • KernelLibrary
  • Pablo
  • ParabixTransform
  • Parallel Deletion
  • Parallel Hashing
  • Performance Testing Script
  • Shuffle Pattern Library
  • StaticCCC
  • String Insertion
  • UCD: Unicode Property Database and Compilers
View All Pages