This is an old version of this page. You can view the most recent version or browse the history.

CSVediting

CSV Editing

After successfully parsing a CSV file, now let's consider how to edit it.

Deleting a column

One of the basic editing operations that we might want to support is deleting a column from all records in a file.

Suppose we want to delete the second column in every row of the following CSV data.

Data_stream:         Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field_separators:    .........1....1.........1...1........1...............1
Record_separators:   ........................1............................1

The Parabix FilterByMask operation can do this for us, if we set up a mask stream that selects all of the data except the second column and its preceding comma.

Data stream:         Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
To keep:             111111111.....11111111111111.........11111111111111111

How do we calculate this mask? With the following set of operations using a PabloBuilder pb.

PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1));
PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
PabloAST * F2start = pb.createAdvance(F1follow, 1);
PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::SpanUpTo, {F1follow, F2follow});
PabloAST * toKeep = pb.createNot(toDelete);

Data stream:         Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
F1start:             1........................1............................
F1follow:            .........1..................1.........................
F2start:             ..........1..................1........................
F2follow:            ..............1......................1................
toDelete:            .........11111..............111111111.................
toKeep:              111111111.....11111111111111.........11111111111111111

A Pablo Kernel to create this mask can be created as follows.

MaskOutField2::MaskOutField2(BuilderRef b, StreamSet * Record_separators, 
                                           StreamSet * Field_separators, 
                                           StreamSet * toKeep)
: PabloKernel(b, "MaskOutField2",
  {Binding{"Record_separators", Record_separators}, 
   Binding{"Field_separators", Field_separators}},
  {Binding{"toKeep", toKeep}})  {}

void MaskOutField2::generatePabloMethod() {
    PabloBuilder pb(getEntryScope());
    Var * Record_separators = pb.createExtract(getInputStreamVar("Record_separators"), pb.getInteger(0));    
    Var * Field_separators = pb.createExtract(getInputStreamVar("Field_separators"), pb.getInteger(0));
    PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1));
    PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
    PabloAST * F2start = pb.createAdvance(F1follow, 1);
    PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
    PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::SpanUpTo, {F1follow, F2follow});
    PabloAST * toKeep = pb.createNot(toDelete);
    pb.createAssign(pb.createExtract(getOutputStreamVar("toKeep"), pb.getInteger(0)), pb.createInFile(toKeep));
}

Of course, a slightly different kernel is needed for masking out a column other than the second one. This should be written using a columnNo parameter to a more generic kernel, and performing the necessary number of ScanTo and Advance operations. The name of the kernel should actually be different for each columnNo.

MaskOutField::MaskOutField2(BuilderRef b, StreamSet * Record_separators, 
                                          StreamSet * Field_separators, 
                                          StreamSet * toKeep,
                                          unsigned columnNo)
: PabloKernel(b, "MaskOutField" + std::to_string(columnNo),
  {Binding{"Record_separators", Record_separators}, 
   Binding{"Field_separators", Field_separators}},
  {Binding{"toKeep", toKeep}})  {}

Finally, the first column must be handled differently. In this case, there is no preceding comma, so the mask should zero out the following comma rather than the preceding comma.