Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
P parabix-devel
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 9
    • Issues 9
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 2
    • Merge requests 2
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • cameron
  • parabix-devel
  • Wiki
  • CSVediting

CSVediting · Changes

Page history
Update CSVediting authored Nov 10, 2021 by cameron's avatar cameron
Hide whitespace changes
Inline Side-by-side
Showing with 48 additions and 6 deletions
+48 -6
  • CSVediting.md CSVediting.md +48 -6
  • No files found.
CSVediting.md
View page @ 51a42548
...@@ -14,22 +14,22 @@ Record_separators: ........................1............................1 ...@@ -14,22 +14,22 @@ Record_separators: ........................1............................1
``` ```
The Parabix `FilterByMask` operation can do this for us, if we set up a mask stream that selects all of the data except the second column and its following comma. The Parabix `FilterByMask` operation can do this for us, if we set up a mask stream that selects all of the data except the second column and its preceding comma.
``` ```
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎ Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
To keep: 1111111111.....11111111111111.........1111111111111111 To keep: 111111111.....11111111111111.........11111111111111111
``` ```
How do we calculate this mask? With the following set of operations using a How do we calculate this mask? With the following set of operations using a
`PabloBuilder pb`. `PabloBuilder pb`.
``` ```
PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(record_separators), 1); PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1);
PabloAST * F1follow = pb.createScanTo(F1start, Field_separators); PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
PabloAST * F2start = pb.createAdvance(F1start, 1); PabloAST * F2start = pb.createAdvance(F1start, 1);
PabloAST * F2follow = pb.createScanTo(F2start, Field_separators); PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {F2start, F2follow}); PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::ExclusiveSpan, {F1follow, F2follow});
PabloAST * toKeep = pb.createNot(toDelete); PabloAST * toKeep = pb.createNot(toDelete);
``` ```
...@@ -39,7 +39,49 @@ F1start: 1........................1............................ ...@@ -39,7 +39,49 @@ F1start: 1........................1............................
F1follow: .........1..................1......................... F1follow: .........1..................1.........................
F2start: ..........1..................1........................ F2start: ..........1..................1........................
F2follow: ..............1......................1................ F2follow: ..............1......................1................
toDelete: ..........11111..............111111111................ toDelete: .........11111..............111111111.................
toKeep: 1111111111.....11111111111111.........1111111111111111 toKeep: 111111111.....11111111111111.........11111111111111111
``` ```
A Pablo Kernel to create this mask can be created as follows.
```
MaskOutField2::MaskOutField2(BuilderRef b, StreamSet * Record_separators,
StreamSet * Field_separators,
StreamSet * toKeep)
: PabloKernel(b, "MaskOutField2",
{Binding{"Record_separators", Record_separators},
Binding{"Field_separators", Field_separators}},
{Binding{"toKeep", toKeep}}) {}
void MaskOutField2::generatePabloMethod() {
PabloBuilder pb(getEntryScope());
Var * Record_separators = pb.createExtract(getInputStreamVar("Record_separators"), pb.getInteger(0));
Var * Field_separators = pb.createExtract(getInputStreamVar("Field_separators"), pb.getInteger(0));
PabloAST * F1start = pb.createNot(pb.createAdvance(pb.createNot(Record_separators), 1);
PabloAST * F1follow = pb.createScanTo(F1start, Field_separators);
PabloAST * F2start = pb.createAdvance(F1start, 1);
PabloAST * F2follow = pb.createScanTo(F2start, Field_separators);
PabloAST * toDelete = pb.createIntrinsicCall(pablo::Intrinsic::ExclusiveSpan, {F1follow, F2follow});
PabloAST * toKeep = pb.createNot(toDelete);
pb.createAssign(pb.createExtract(getOutputStreamVar("toKeep"), pb.getInteger(0)), pb.createInFile(toKeep));
}
```
Of course, a slightly different kernel is needed for masking out a column other than the
second one. This should be written using a `columnNo` parameter to a more generic
kernel, and performing the necessary number of `ScanTo` and `Advance` operations.
The name of the kernel should actually be different for each columnNo.
```
MaskOutField::MaskOutField2(BuilderRef b, StreamSet * Record_separators,
StreamSet * Field_separators,
StreamSet * toKeep,
unsigned columnNo)
: PabloKernel(b, "MaskOutField" + std::to_string(columnNo),
{Binding{"Record_separators", Record_separators},
Binding{"Field_separators", Field_separators}},
{Binding{"toKeep", toKeep}}) {}
```
Finally, the first column must be handled differently. In this case, there is
no preceding comma, so the mask should zero out the following comma rather than the
preceding comma.
Clone repository
  • Bracket Matching
  • CSV Validation
  • CSVediting
  • CSVparsing
  • Character Code Compilers
  • KernelLibrary
  • Pablo
  • ParabixTransform
  • Parallel Deletion
  • Parallel Hashing
  • Performance Testing Script
  • Shuffle Pattern Library
  • StaticCCC
  • String Insertion
  • UCD: Unicode Property Database and Compilers
View All Pages