String Insertion
The string insertion problem is the problem of inserting a fixed string at all selected locations in a text stream.
Consider the string insertion problem to insert the string pre
into a data
stream according to a stream marking insert positions with one bits.
For example, consider the data stream below together with a mask stream marking
insertion positions.
data: He pared the apple and then added selected spices.
mask: ...1..............................1...............
The stream resulting from insertion of the fixed string pre
before the marked positions is:
result: He prepared the apple and then added preselected spices.
Insertion Method Overview
To solve the string insertion problem, using Parabix methods, we use a process consisting of the following steps.
- Transpose the data stream to Parabix form.
- Insert zeroes into the data stream to make room for inserted data.
- Create a repeating stream set with fixed string to be inserted.
- Spread the repeating stream data to align with the inserted zeroes.
- Merge the stream sets together.
- Perform inverse Parabix transposition to produce the final output.
Insertion of Zeroes
For insertion of the string pre
in marked positions in the file, we
first need to create a string insertion bixnum that shows the number of
zeroes to insert at each position. A bixnum is a set of bit streams giving
the binary representation of the number of bits to insert. In the case
of insertion of pre
, the number of bits to insert is 3 and the binary
representation is 11
. So our bixnum B for insertion is:
data: He pared the apple and then added selected spices.
B[1] ...1..............................1...............
B[0] ...1..............................1...............
Note that the number of bits to be inserted at each position is 0 for all positions except those in the mask.
The bixnum B is created as follows:
std::vector<unsigned> insertAmts = {3};
StreamSet * B = E->CreateStreamSet(2, 1);
E->CreateKernelCall<ZeroInsertBixNum>(insertAmts, mask, B);
Using this bixnum, we can now create a spread mask for spreading out the data.
StreamSet * SpreadMask = InsertionSpreadMask(E, InsertBixNum, InsertPosition::Before);
The resulting SpreadMask is
spread: 111...11111111111111111111111111111111...111111111111111
The SpreadByMask operation can then use this mask to spread out the original basis data to produce a new expanded stream set with the zeroes inserted.
StreamSet * ExpandedBasis = E->CreateStreamSet(8);
SpreadByMask(E, SpreadMask, BasisBits, ExpandedBasis);
Repeating Stream Set Creation.
std::vector<uint64_t> insertBytes;
for (auto ch : "pre") {
insertBytes.push_back(static_cast<uint64_t>(ch));
}
StreamSet * RepeatingBasis = CreateRepeatingBixNum(E, 8, insertBytes);
This will create an infinite set of 8 bit streams with the transposed representation of "pre" repeating indefinitely.
The code for CreateRepeatingBixNum
can be found in the csv2json.cpp
file.
Spreading the Repeated Stream and Merging
Once we have the repeated stream, we can use SpreadByMask to spread
it out to positions associated with the inserted zeroes in our ExpandedBasis
.
In this case the spread mask used must be the inverse of the spread mask
used to spread the data stream. That is:
invmsk: ...111................................111...............
After spreading out the repeated pre
bits, we can then use simple
bitwise logic to merge this data with ExpandedBasis
.
The result will have the correct insertions of the string performed in the resulting transposed data stream.
Using the P2S_Kernel
, we can transform this back to a byte stream and generate it to stdout.