CSV Editing
After successfully parsing a CSV file, now let's consider how to edit it.
Deleting a column
One of the basic editing operations that we might want to support is deleting a column from all records in a file.
Suppose we want to delete the second column in every row of the following CSV data.
Data_stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field_separators: .........1....1.........1...1........1...............1
Record_separators: ........................1............................1
The Parabix FilterByMask
operation can do this for us, if we set up a mask stream that selects all of the data except the second column and its preceding comma.
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
To keep: 111111111.....11111111111111.........11111111111111111
How do we calculate this mask? With the following set of operations using a
PabloBuilder pb
.
PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1));
PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
PabloAST * F2start = pb.createAdvance(F1follow, 1);
PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::SpanUpTo, {F1follow, F2follow});
PabloAST * toKeep = pb.createNot(toDelete);
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
F1start: 1........................1............................
F1follow: .........1..................1.........................
F2start: ..........1..................1........................
F2follow: ..............1......................1................
toDelete: .........11111..............111111111.................
toKeep: 111111111.....11111111111111.........11111111111111111
A Pablo Kernel to create this mask can be created as follows.
MaskOutField2::MaskOutField2(BuilderRef b, StreamSet * Record_separators,
StreamSet * Field_separators,
StreamSet * toKeep)
: PabloKernel(b, "MaskOutField2",
{Binding{"Record_separators", Record_separators},
Binding{"Field_separators", Field_separators}},
{Binding{"toKeep", toKeep}}) {}
void MaskOutField2::generatePabloMethod() {
PabloBuilder pb(getEntryScope());
Var * Record_separators = pb.createExtract(getInputStreamVar("Record_separators"), pb.getInteger(0));
Var * Field_separators = pb.createExtract(getInputStreamVar("Field_separators"), pb.getInteger(0));
PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1));
PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
PabloAST * F2start = pb.createAdvance(F1follow, 1);
PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::SpanUpTo, {F1follow, F2follow});
PabloAST * toKeep = pb.createNot(toDelete);
pb.createAssign(pb.createExtract(getOutputStreamVar("toKeep"), pb.getInteger(0)), pb.createInFile(toKeep));
}
Of course, a slightly different kernel is needed for masking out a column other than the
second one. This should be written using a columnNo
parameter to a more generic
kernel, and performing the necessary number of ScanTo
and Advance
operations.
The name of the kernel should actually be different for each columnNo.
MaskOutField::MaskOutField(BuilderRef b, StreamSet * Record_separators,
StreamSet * Field_separators,
StreamSet * toKeep,
unsigned columnNo)
: PabloKernel(b, "MaskOutField" + std::to_string(columnNo),
{Binding{"Record_separators", Record_separators},
Binding{"Field_separators", Field_separators}},
{Binding{"toKeep", toKeep}}) {}
Finally, the first column must be handled differently. In this case, there is no preceding comma, so the mask should zero out the following comma rather than the preceding comma.
Matching and Deleting a Row
Suppose that we have a regular expression R to select CSV rows for deletion, where R has no Unicode properties or other features.
The Parabix regular expression engine can be used to perform the matching and
FilterByMask
can be used for the deletion.
Using the ICGrepKernel to perform the matching may be implemented as follows.
auto options = std::make_unique<GrepKernelOptions>(&cc::UTF8);
options->setSource(BasisBits);
StreamSet * MatchResults = P->CreateStreamSet(1, 1);
options->setResults(MatchResults);
options->setRE(R);
P->CreateKernelCall<ICGrepKernel>(std::move(options));
The resulting MatchResults
stream will have 1 bits on any matching CSV row.
To select the row, the next task is to move the matches to the line end position,
assuming that the line ends are given by mLineBreakStream
.
StreamSet * const MovedMatches = P->CreateStreamSet();
P->CreateKernelCall<MatchedLinesKernel>(MatchResults, mLineBreakStream, MovedMatches);
We can next get a stream that is indexed by line number (1 bit per CSV row).
StreamSet * MatchesByLine = P->CreateStreamSet(1, 1);
FilterByMask(P, mLineBreakStream, MovedMatches, MatchesByLine);
LineStarts can then be identified as the positions immediately after a line break
or at the beginning of the file. These are computed by the LineStartsKernel
.
StreamSet * LineStarts = E->CreateStreamSet(1, 1);
P->CreateKernelCall<LineStartsKernel>(mLineBreakStream, LineStarts);
The starts of the matched lines are now computed by a SpreadByMask
.
StreamSet * MatchedLineStarts = E->CreateStreamSet(1, 1);
SpreadByMask(E, LineStarts, MatchesByLine, MatchedLineStarts);
Now a mask for an entire matched row can be computed, using the LineSpansKernel.
StreamSet * MatchedLineSpans = E->CreateStreamSet(1, 1);
P->CreateKernelCall<LineSpansKernel>(MatchedLineStarts, MatchedLineEnds, MatchedLineSpans);
If FilterByMask was used at this point, you would get the matched rows. To delete the matched rows, the MatchedLineSpans must be negated (use a Pablo createNot operation).
Constraining Matches to a Column
Matches can be constrained to a particular column. The general method for matches within a row can be modified by using a column mask having 1 bits only within the column and supplying that as input to the regular expression matching process. This will require a modification to the options of the ICGrep kernel, to supply this as an option when calling the RE compiler. When the ICGrep kernel is called, the mask must then be passed as the second parameter to the compileRE method of the RE compiler.
Marker compileRE(RE * re, Marker initialMarkers);
If the marker stream returned by the RE_compiler has a one bit anywhere within the column, then a match is found.
Combining Masks
If masks are computed to edit out both a column and a row, these can be combined with a Pablo createAnd operation and then one FilterByMask can be applied.