CSV Editing
After successfully parsing a CSV file, now let's consider how to edit it.
Deleting a column
One of the basic editing operations that we might want to support is deleting a column from all records in a file.
Suppose we want to delete the second column in every row of the following CSV data.
Data_stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field_separators: .........1....1.........1...1........1...............1
Record_separators: ........................1............................1
The Parabix FilterByMask
operation can do this for us, if we set up a mask stream that selects all of the data except the second column and its preceding comma.
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
To keep: 111111111.....11111111111111.........11111111111111111
How do we calculate this mask? With the following set of operations using a
PabloBuilder pb
.
PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1));
PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
PabloAST * F2start = pb.createAdvance(F1follow, 1);
PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::SpanUpTo, {F1follow, F2follow});
PabloAST * toKeep = pb.createNot(toDelete);
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
F1start: 1........................1............................
F1follow: .........1..................1.........................
F2start: ..........1..................1........................
F2follow: ..............1......................1................
toDelete: .........11111..............111111111.................
toKeep: 111111111.....11111111111111.........11111111111111111
A Pablo Kernel to create this mask can be created as follows.
MaskOutField2::MaskOutField2(BuilderRef b, StreamSet * Record_separators,
StreamSet * Field_separators,
StreamSet * toKeep)
: PabloKernel(b, "MaskOutField2",
{Binding{"Record_separators", Record_separators},
Binding{"Field_separators", Field_separators}},
{Binding{"toKeep", toKeep}}) {}
void MaskOutField2::generatePabloMethod() {
PabloBuilder pb(getEntryScope());
Var * Record_separators = pb.createExtract(getInputStreamVar("Record_separators"), pb.getInteger(0));
Var * Field_separators = pb.createExtract(getInputStreamVar("Field_separators"), pb.getInteger(0));
PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1));
PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
PabloAST * F2start = pb.createAdvance(F1follow, 1);
PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::SpanUpTo, {F1follow, F2follow});
PabloAST * toKeep = pb.createNot(toDelete);
pb.createAssign(pb.createExtract(getOutputStreamVar("toKeep"), pb.getInteger(0)), pb.createInFile(toKeep));
}
Of course, a slightly different kernel is needed for masking out a column other than the
second one. This should be written using a columnNo
parameter to a more generic
kernel, and performing the necessary number of ScanTo
and Advance
operations.
The name of the kernel should actually be different for each columnNo.
MaskOutField::MaskOutField(BuilderRef b, StreamSet * Record_separators,
StreamSet * Field_separators,
StreamSet * toKeep,
unsigned columnNo)
: PabloKernel(b, "MaskOutField" + std::to_string(columnNo),
{Binding{"Record_separators", Record_separators},
Binding{"Field_separators", Field_separators}},
{Binding{"toKeep", toKeep}}) {}
Finally, the first column must be handled differently. In this case, there is no preceding comma, so the mask should zero out the following comma rather than the preceding comma.
Matching and Deleting a Row
Suppose that we have a regular expression R to select CSV rows for deletion.
The Parabix regular expression engine can be used to perform the matching and
FilterByMask
can be used for the deletion.
Using the ICGrepKernel to perform the matching may be implemented as follows.
auto options = std::make_unique<GrepKernelOptions>(&cc::UTF8);
options->setSource(BasisBits);
StreamSet * MatchResults = P->CreateStreamSet(1, 1);
options->setResults(MatchResults);
options->setRE(R);
P->CreateKernelCall<ICGrepKernel>(std::move(options));