CSV Editing
After successfully parsing a CSV file, now let's consider how to edit it.
Deleting a column
One of the basic editing operations that we might want to support is deleting a column from all records in a file.
Suppose we want to delete the second column in every row of the following CSV data.
Data_stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field_separators: .........1....1.........1...1........1...............1
Record_separators: ........................1............................1
The Parabix FilterByMask
operation can do this for us, if we set up a mask stream that selects all of the data except the second column and its preceding comma.
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
To keep: 111111111.....11111111111111.........11111111111111111
How do we calculate this mask? With the following set of operations using a
PabloBuilder pb
.
PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1));
PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
PabloAST * F2start = pb.createAdvance(F1follow, 1);
PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::SpanUpTo, {F1follow, F2follow});
PabloAST * toKeep = pb.createNot(toDelete);
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
F1start: 1........................1............................
F1follow: .........1..................1.........................
F2start: ..........1..................1........................
F2follow: ..............1......................1................
toDelete: .........11111..............111111111.................
toKeep: 111111111.....11111111111111.........11111111111111111
A Pablo Kernel to create this mask can be created as follows.
MaskOutField2::MaskOutField2(BuilderRef b, StreamSet * Record_separators,
StreamSet * Field_separators,
StreamSet * toKeep)
: PabloKernel(b, "MaskOutField2",
{Binding{"Record_separators", Record_separators},
Binding{"Field_separators", Field_separators}},
{Binding{"toKeep", toKeep}}) {}
void MaskOutField2::generatePabloMethod() {
PabloBuilder pb(getEntryScope());
Var * Record_separators = pb.createExtract(getInputStreamVar("Record_separators"), pb.getInteger(0));
Var * Field_separators = pb.createExtract(getInputStreamVar("Field_separators"), pb.getInteger(0));
PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1));
PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
PabloAST * F2start = pb.createAdvance(F1follow, 1);
PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::SpanUpTo, {F1follow, F2follow});
PabloAST * toKeep = pb.createNot(toDelete);
pb.createAssign(pb.createExtract(getOutputStreamVar("toKeep"), pb.getInteger(0)), pb.createInFile(toKeep));
}
Of course, a slightly different kernel is needed for masking out a column other than the
second one. This should be written using a columnNo
parameter to a more generic
kernel, and performing the necessary number of ScanTo
and Advance
operations.
The name of the kernel should actually be different for each columnNo.
MaskOutField::MaskOutField(BuilderRef b, StreamSet * Record_separators,
StreamSet * Field_separators,
StreamSet * toKeep,
unsigned columnNo)
: PabloKernel(b, "MaskOutField" + std::to_string(columnNo),
{Binding{"Record_separators", Record_separators},
Binding{"Field_separators", Field_separators}},
{Binding{"toKeep", toKeep}}) {}
Finally, the first column must be handled differently. In this case, there is no preceding comma, so the mask should zero out the following comma rather than the preceding comma.
Matching and Deleting a Row
Suppose that we have a regular expression R to select CSV rows for deletion, where R has no Unicode properties or other features.
The Parabix regular expression engine can be used to perform the matching and
FilterByMask
can be used for the deletion.
Using the ICGrepKernel to perform the matching may be implemented as follows.
auto options = std::make_unique<GrepKernelOptions>(&cc::UTF8);
options->setSource(BasisBits);
StreamSet * MatchResults = P->CreateStreamSet(1, 1);
options->setResults(MatchResults);
options->setRE(R);
P->CreateKernelCall<ICGrepKernel>(std::move(options));
The resulting MatchResults
stream will have 1 bits on any matching CSV row.
To select the row, the next task is to move the matches to the line end position,
assuming that the line ends are given by mLineBreakStream
.
StreamSet * const MovedMatches = P->CreateStreamSet();
P->CreateKernelCall<MatchedLinesKernel>(MatchResults, mLineBreakStream, MovedMatches);
We can next get a stream that is indexed by line number (1 bit per CSV row).
StreamSet * MatchesByLine = P->CreateStreamSet(1, 1);
FilterByMask(P, mLineBreakStream, MovedMatches, MatchesByLine);
LineStarts can then be identified as the positions immediately after a line break
or at the beginning of the file. These are computed by the LineStartsKernel
.
StreamSet * LineStarts = E->CreateStreamSet(1, 1);
P->CreateKernelCall<LineStartsKernel>(mLineBreakStream, LineStarts);
The starts of the matched lines are now computed by a SpreadByMask
.
StreamSet * MatchedLineStarts = E->CreateStreamSet(1, 1);
SpreadByMask(E, LineStarts, MatchesByLine, MatchedLineStarts);
Now a mask for an entire matched row can be computed, using the LineSpansKernel.
StreamSet * MatchedLineSpans = E->CreateStreamSet(1, 1);
P->CreateKernelCall<LineSpansKernel>(MatchedLineStarts, MatchedLineEnds, MatchedLineSpans);
If FilterByMask was used at this point, you would get the matched rows. To delete the matched rows, the MatchedLineSpans must be negated (use a Pablo createNot operation).
Constraining Matches to a Column
Matches can be constrained to a particular column. The general method for matches within a row can be modified by using a column mask having 1 bits only within the column and supplying that as input to the regular expression matching process. This will require a modification to the options of the ICGrep kernel, to supply this as an option when calling the RE compiler. When the ICGrep kernel is called, the mask must then be passed as the second parameter to the compileRE method of the RE compiler.
Marker compileRE(RE * re, Marker initialMarkers);
If the marker stream returned by the RE_compiler has a one bit anywhere within fields belonging to the column, then a match is found.
Matches within the fields of the column can then be moved to the field separator position of the column, using the MatchedLinesKernel
as shown above, but substituting the field separator stream in place of mLineBreakStream
.
Call this stream FieldMatchMarks
.
Given the FieldMatchMarks
stream, then a mask covering the entire
field can be produced using logic similar to that for masks selecting rows.
The first step is to compute a stream MatchesByField
by using the
FilterByMask
function with the field separator stream as the mask and FieldMatchMarks
as the data stream. The second step is to compute
a stream marking field starts using the LineStartsKernel
as shown above,
but again substituting the field separator stream in place of mLineBreakStream
.
The third step is to compute matched field starts used SpreadByMask
with the field starts stream as the mask applied to MatchesByField
. Finally, the mask
covering matched fields can be computed using the LineSpansKernel
applied
to the matched field starts and FieldMatchMarks
.
Combining Masks
If masks are computed to edit out both a column and a row, these can be combined with a Pablo createAnd operation and then one FilterByMask can be applied.