|
|
|
|
|
# Towards a Library of Shufflevector Patterns
|
|
|
|
|
|
The LLVM IR contains a very powerful vector operation called `shufflevector`. Unless the target architecture supports a similarly powerful operation directly, good code generation for shufflevector instructions with arbitrary shuffle patterns is very difficult.
|
|
|
|
|
|
However, there are many `shufflevector` patterns that map to simpler operations, both at the LLVM IR level and at the target architecture level. This document introduces a library of patterns as well as a systematic approach to code generation for `shufflevector` operations.
|
|
|
|
|
|
## Pattern Recognition
|
|
|
|
|
|
With a potentially large library of shuffle vector patterns and corresponding implementations, how do we systematically search for patterns without incurring high pattern matching costs for brute-force sequential search? One answer is to abstract the patterns and to define quick, parallel methods of pattern recognition.
|
|
|
|
|
|
### Shuffle Vector Canonical Form
|
|
|
|
|
|
In order to facilitate recognition of shuffle vector patterns, we first perform some transformations to put shuffle vectors in canonical form.
|
|
|
|
|
|
#### An Initial Example
|
|
|
|
|
|
For example, consider the following three shufflevector patterns are all equivalent.
|
|
|
|
|
|
```
|
|
|
shufflevector <4 x i32> zeroinitializer, <4 x i32> %x, <4 x i32> <i32 4, i32 2, i32 5, i32 1>
|
|
|
|
|
|
shufflevector <4 x i32> %x, <4 x i32> zeroinitializer, <4 x i32> <i32 0, i32 6, i32 1, i32 5>
|
|
|
|
|
|
shufflevector <4 x i32> %x, <4 x i32> zeroinitializer, <4 x i32> <i32 0, i32 4, i32 1, i32 4>
|
|
|
```
|
|
|
|
|
|
This example illustrates two points about equivalent forms. First, it is always possible to produce equivalent results by exchanging the order of the two input vectors and modifying the shuffle selection mask, accordingly. Second, when one of the input vectors is a vector of all zeroes (`zeroinitializer`), then any selection mask index in this vector is equivalent to any other.
|
|
|
|
|
|
This example also illustrates two rules for the conversion to canonical form in this case. As described below, if `zeroinitializer` occurs as an input vector, then it should normally occur as the second input vector. Secondly, the selection mask values for `zeroinitializer` in canonical form always refer to the lowest element. In the example, the selection mask values 5 and 6 are replaced by 4.
|
|
|
Canonical Selection from Constant Splats
|
|
|
|
|
|
In general, in discussing canonical forms, we refer to the three consecutive arguments of `shufflevector` as vector 1 or v1, vector 2 or v2 and mask.
|
|
|
|
|
|
A constant splat is a constant vector having the same value in every field. When used as a vector, `zeroinitializer`` is a constant splat of the value 0.
|
|
|
|
|
|
Whenever one of the input vectors is a constant splat vector, then the canonical selection mask values for selecting a value from this vector must always specify selection from the first field in the vector. That is, if the size of the input vectors is n, and vector 2 is a constant splat, then the index n is the canonical index for selection from this vector. Conversion to constant indexes in this case is illustrated in the second step of the example above. If vector 1 is a constant splat, then the canonical index for selection from this vector is 0.
|
|
|
|
|
|
#### Canonical Order of Vectors
|
|
|
|
|
|
Consider a `shufflevector` invocation of the following form.
|
|
|
|
|
|
```
|
|
|
shufflevector <n x <ty>> v1, <n x <ty>> v2, <m x i32> mask
|
|
|
```
|
|
|
|
|
|
Given such a `shufflevector` is always possible to reverse the order of the vectors with a suitable transformation of the mask.
|
|
|
|
|
|
```
|
|
|
shufflevector <n x <ty>> v2, <n x <ty>> v1, <m x i32> mask'
|
|
|
```
|
|
|
|
|
|
The mask transformation replaces each mask element mask[i] with the value which is either mask[i] + n, if mask[i] < n, or mask[i] - n, if mask[i] >= n. This may be expressed by the following equation.
|
|
|
|
|
|
```
|
|
|
mask'[i] = (mask[i] + n) mod (2n)
|
|
|
```
|
|
|
|
|
|
The following rules define canonical order for the vector arguments.
|
|
|
|
|
|
If one of the vectors is `undef`, then vector 2 must be `undef` in canonical order.
|
|
|
If one of the vectors is `zeroinitializer `(or the equivalent constant), and the other vector is not undef, then vector 2 must be `zeroinitializer` in canonical order.
|
|
|
If one of the vectors is a constant vector and the other vector is neither constant nor undef, then vector 2 must be a constant vector in canonical order.
|
|
|
If neither vector is a constant splat or undef, then the first selector value must be < n, (i.e., selecting from vector 1).
|
|
|
|
|
|
#### Canonical Vector Size
|
|
|
|
|
|
The canonical size for vectors is the nearest power of 2 greater than or equal to the vector size.
|
|
|
|
|
|
Consider a `shufflevector` invocation of the following form where n is not a power of 2.
|
|
|
|
|
|
```
|
|
|
shufflevector <n x <ty>> v1, <n x <ty>> v2, <m x i32> mask
|
|
|
````
|
|
|
|
|
|
We convert to the form
|
|
|
|
|
|
```
|
|
|
shufflevector <n' x <ty>> v1, <n' x <ty>> v2, <m x i32> mask'
|
|
|
```
|
|
|
|
|
|
where n' is the nearest power of 2 at least equal to n, and
|
|
|
|
|
|
```
|
|
|
mask'[i] = (mask[i] mod n + (mask[i] div n)) * n'
|
|
|
```
|
|
|
|
|
|
### Lane Detection in Patterns
|
|
|
|
|
|
Lane detection determines whether the `shufflevector` operation is a natural SIMD operation, i.e., whether it may be viewed as a set of parallel operations in 2n lanes for some value of n.
|
|
|
|
|
|
Lane detection is a recursive binary process. The first step is to determine if the overall operation may be viewed as two separate instances of the same operation applied independently in two lanes. If so, then the operation within the lanes are analyzed to determine whether a further division by 2 is possible. When no further binary division into lanes is possible, the process terminates.
|
|
|
|
|
|
To determine whether a shuffle vector pattern may be divided into lanes, the following algorithm is applied.
|
|
|
|
|
|
Given a shuffle vector mask of 2n elements labelled m0, ... m2n-1, form the two submask vectors m0, m2n-1-1 and m2n-1, ... m2n-1.
|
|
|
Compare the two vectors, element by element. The mask is ruled a laned operation if the following conditions hold in each case.
|
|
|
The value of m2n-1+i - mi = 2n-1, or
|
|
|
The value of m2n-1+i = mi = 2n, and vector 2 is a constant splat.
|
|
|
|
|
|
### Useful Functions
|
|
|
|
|
|
```
|
|
|
def pattern_diff(P, m, k):
|
|
|
return [P[i + k] - P[i] for i in range(0, m-k)]
|
|
|
|
|
|
def pattern_signature(P):
|
|
|
return (P[0], pattern_diff(P, len(P), 1))
|
|
|
```
|
|
|
|
|
|
### Standard Pattern Examples
|
|
|
|
|
|
In the following examples, the patterns are defined assuming that the shufflevector arguments have been first placed in canonical form as described above and lanes have been determined.
|
|
|
|
|
|
#### The Rotate Pattern
|
|
|
|
|
|
Rotations may easily be identified as patterns which have a pattern unit difference of 1 in all cases, except a single instance of a 1-n difference, where n is the vector (or double vector) size.
|
|
|
|
|
|
```
|
|
|
>>> pattern_diff([1,2,3,4,5,6,7,0], 8,1)
|
|
|
[1, 1, 1, 1, 1, 1, -7]
|
|
|
>>> pattern_diff([2,3,4,5,6,7,0,1], 8,1)
|
|
|
[1, 1, 1, 1, 1, -7, 1]
|
|
|
>>> pattern_diff([3,4,5,6,7,0,1,2], 8,1)
|
|
|
[1, 1, 1, 1, -7, 1, 1]
|
|
|
```
|
|
|
|
|
|
A rotation may occur within lanes. For example, the pattern [1,2,3,0,5,6,7,4] is a rotate right 1 within each of 2 lanes. The lane detection is accomplished by noting that the n/2 pattern diff is a constant vector of n/2 values. Then the rotation pattern is exposed by examining a single lane.
|
|
|
|
|
|
```
|
|
|
>>> pattern_diff([1,2,3,0,5,6,7,4], 8, 4)
|
|
|
[4, 4, 4, 4]
|
|
|
>>> pattern_diff([1,2,3,0,5,6,7,4], 4, 1)
|
|
|
[1, 1, -3]
|
|
|
>>>
|
|
|
```
|
|
|
|
|
|
#### The Shift Left Pattern
|
|
|
|
|
|
Shift left patterns require that the second `shufflevector` argument be `zeroinitializer` so that 0s can be inserted on the right. Note the little-endianness issue.
|
|
|
|
|
|
```
|
|
|
>>> pattern_signature([8,8,1,2,3,4,5,6])
|
|
|
(8, [0, -7, 1, 1, 1, 1, 1])
|
|
|
>>> pattern_signature([8,8,8,1,2,3,4,5])
|
|
|
(8, [0, 0, -7, 1, 1, 1, 1])
|
|
|
```
|
|
|
|
|
|
#### The Shift Right Logical Pattern
|
|
|
|
|
|
Shift right logical patterns require that the second `shufflevector` argument be `zeroinitializer` so that 0s can be inserted on the left. Note little-endianness. The shift constant is conveniently the value of the first element times the field width.
|
|
|
|
|
|
```
|
|
|
(2, [1, 1, 1, 1, 1, 1, 0])
|
|
|
>>> pattern_signature([1,2,3,4,5,6,7,8])
|
|
|
(1, [1, 1, 1, 1, 1, 1, 1])
|
|
|
>>> pattern_signature([0,1,2,3,4,5,6,7])
|
|
|
(0, [1, 1, 1, 1, 1, 1, 1])
|
|
|
>>> pattern_signature([5,6,7,8,8,8,8,8])
|
|
|
(5, [1, 1, 1, 0, 0, 0, 0])
|
|
|
```
|
|
|
|
|
|
#### The Merge Pattern
|
|
|
|
|
|
Merge patterns have alternation n and 1-n values.
|
|
|
|
|
|
```
|
|
|
>>> pattern_signature([0,8,1,9,2,10,3,11])
|
|
|
(0, [8, -7, 8, -7, 8, -7, 8])
|
|
|
```
|
|
|
|
|
|
#### The Pack Pattern
|
|
|
|
|
|
Pack patterns are characterized by a constant pattern difference of 2.
|
|
|
|
|
|
```
|
|
|
>>> pattern_signature([0,2,4,6,8,10,12,14])
|
|
|
(0, [2, 2, 2, 2, 2, 2, 2])
|
|
|
>>> pattern_signature([1,3,5,7,9,11,13,15])
|
|
|
(1, [2, 2, 2, 2, 2, 2, 2])
|
|
|
```
|
|
|
|
|
|
#### The Blend Pattern
|
|
|
|
|
|
Blend patterns specify that each selected element comes from the corresponding position of one of the two input vectors. The following example illustrates.
|
|
|
|
|
|
```
|
|
|
%x = shufflevector <4 x i32> <i32 20, i32 21, i32 22, i32 23>,
|
|
|
<4 x i32> <i32 90, i32 91, i32 92, i32 93>,
|
|
|
<4 x i32> <i32 0, i32 5, i32 6, i32 3>
|
|
|
# yields
|
|
|
# %x = <4 x i32> <i32 20, i32 91, i32 92, i32 23>
|
|
|
```
|
|
|
|
|
|
That is, element i of the pattern is either i or n + i, where n is the number of elements in each of the input vectors.
|
|
|
|
|
|
#### The Parallel Extract Pattern
|
|
|
|
|
|
The Parallel Extract Pattern is essentially a vector filter operation, which preserves some elements from a given input vector, deleting others. Argument #2 must be `zeroinitializer`.
|
|
|
|
|
|
The pattern must consist of a set of increasing selectors from the first vector, after which all remaining selectors are zero selectors (i.e., selecting from `zeroinitializer`.
|
|
|
|
|
|
In the case of i1 vectors, this corresponds to the parallel bit extract of the BitShuffle project.
|
|
|
|
|
|
#### The Parallel Deposit Pattern
|
|
|
|
|
|
The Parallel Deposit Pattern is essentially a vector expand operation, which preserves some elements from a given input vector, inserting zeroes inbetween. Argument #2 must be `zeroinitializer`.
|
|
|
|
|
|
The pattern must consist of a set of consecutive selectors from the first vector (starting with 0), interspersed with zero selectors (i.e., selecting from `zeroinitializer`.
|
|
|
|
|
|
In the case of i1 vectors, this corresponds to the parallel bit deposit of the BitShuffle project. |