Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
P parabix-devel
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 9
    • Issues 9
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 2
    • Merge requests 2
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • cameron
  • parabix-devel
  • Wiki
  • CSVediting

Last edited by cameron Feb 22, 2023
Page history
This is an old version of this page. You can view the most recent version or browse the history.

CSVediting

CSV Editing

After successfully parsing a CSV file, now let's consider how to edit it.

Deleting a column

One of the basic editing operations that we might want to support is deleting a column from all records in a file.

Suppose we want to delete the second column in every row of the following CSV data.

Data_stream:         Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field_separators:    .........1....1.........1...1........1...............1
Record_separators:   ........................1............................1

The Parabix FilterByMask operation can do this for us, if we set up a mask stream that selects all of the data except the second column and its preceding comma.

Data stream:         Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
To keep:             111111111.....11111111111111.........11111111111111111

How do we calculate this mask? With the following set of operations using a PabloBuilder pb.

PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1));
PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
PabloAST * F2start = pb.createAdvance(F1follow, 1);
PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::SpanUpTo, {F1follow, F2follow});
PabloAST * toKeep = pb.createNot(toDelete);
Data stream:         Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
F1start:             1........................1............................
F1follow:            .........1..................1.........................
F2start:             ..........1..................1........................
F2follow:            ..............1......................1................
toDelete:            .........11111..............111111111.................
toKeep:              111111111.....11111111111111.........11111111111111111

A Pablo Kernel to create this mask can be created as follows.

MaskOutField2::MaskOutField2(BuilderRef b, StreamSet * Record_separators, 
                                           StreamSet * Field_separators, 
                                           StreamSet * toKeep)
: PabloKernel(b, "MaskOutField2",
  {Binding{"Record_separators", Record_separators}, 
   Binding{"Field_separators", Field_separators}},
  {Binding{"toKeep", toKeep}})  {}

void MaskOutField2::generatePabloMethod() {
    PabloBuilder pb(getEntryScope());
    Var * Record_separators = pb.createExtract(getInputStreamVar("Record_separators"), pb.getInteger(0));    
    Var * Field_separators = pb.createExtract(getInputStreamVar("Field_separators"), pb.getInteger(0));
    PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1));
    PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
    PabloAST * F2start = pb.createAdvance(F1follow, 1);
    PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
    PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::SpanUpTo, {F1follow, F2follow});
    PabloAST * toKeep = pb.createNot(toDelete);
    pb.createAssign(pb.createExtract(getOutputStreamVar("toKeep"), pb.getInteger(0)), pb.createInFile(toKeep));
}

Of course, a slightly different kernel is needed for masking out a column other than the second one. This should be written using a columnNo parameter to a more generic kernel, and performing the necessary number of ScanTo and Advance operations. The name of the kernel should actually be different for each columnNo.

MaskOutField::MaskOutField(BuilderRef b, StreamSet * Record_separators, 
                                          StreamSet * Field_separators, 
                                          StreamSet * toKeep,
                                          unsigned columnNo)
: PabloKernel(b, "MaskOutField" + std::to_string(columnNo),
  {Binding{"Record_separators", Record_separators}, 
   Binding{"Field_separators", Field_separators}},
  {Binding{"toKeep", toKeep}})  {}

Finally, the first column must be handled differently. In this case, there is no preceding comma, so the mask should zero out the following comma rather than the preceding comma.

Matching and Deleting a Row

Suppose that we have a regular expression R to select CSV rows for deletion.
The Parabix regular expression engine can be used to perform the matching and FilterByMask can be used for the deletion.

Using the ICGrepKernel to perform the matching may be implemented as follows.

    auto options = std::make_unique<GrepKernelOptions>(&cc::UTF8);
    options->setSource(BasisBits);
    StreamSet * MatchResults = P->CreateStreamSet(1, 1);
    options->setResults(MatchResults);
    options->setRE(R);
    P->CreateKernelCall<ICGrepKernel>(std::move(options));
Clone repository
  • Bracket Matching
  • CSV Validation
  • CSVediting
  • CSVparsing
  • Character Code Compilers
  • KernelLibrary
  • Pablo
  • ParabixTransform
  • Parallel Deletion
  • Parallel Hashing
  • Performance Testing Script
  • Shuffle Pattern Library
  • StaticCCC
  • String Insertion
  • UCD: Unicode Property Database and Compilers
View All Pages