... | @@ -88,7 +88,9 @@ preceding comma. |
... | @@ -88,7 +88,9 @@ preceding comma. |
|
|
|
|
|
## Matching and Deleting a Row
|
|
## Matching and Deleting a Row
|
|
|
|
|
|
Suppose that we have a regular expression R to select CSV rows for deletion.
|
|
Suppose that we have a regular expression R to select CSV rows for deletion,
|
|
|
|
where R has no Unicode properties or other features.
|
|
|
|
|
|
The Parabix regular expression engine can be used to perform the matching and
|
|
The Parabix regular expression engine can be used to perform the matching and
|
|
`FilterByMask` can be used for the deletion.
|
|
`FilterByMask` can be used for the deletion.
|
|
|
|
|
... | @@ -103,6 +105,41 @@ Using the ICGrepKernel to perform the matching may be implemented as follows. |
... | @@ -103,6 +105,41 @@ Using the ICGrepKernel to perform the matching may be implemented as follows. |
|
P->CreateKernelCall<ICGrepKernel>(std::move(options));
|
|
P->CreateKernelCall<ICGrepKernel>(std::move(options));
|
|
```
|
|
```
|
|
|
|
|
|
|
|
The resulting `MatchResults` stream will have 1 bits on any matching CSV row.
|
|
|
|
To select the row, the next task is to move the matches to the line end position,
|
|
|
|
assuming that the line ends are given by `mLineBreakStream`.
|
|
|
|
|
|
|
|
```
|
|
|
|
StreamSet * const MovedMatches = P->CreateStreamSet();
|
|
|
|
P->CreateKernelCall<MatchedLinesKernel>(MatchResults, mLineBreakStream, MovedMatches);
|
|
|
|
```
|
|
|
|
|
|
|
|
We can next get a stream that is indexed by line number (1 bit per CSV row).
|
|
|
|
```
|
|
|
|
StreamSet * MatchesByLine = P->CreateStreamSet(1, 1);
|
|
|
|
FilterByMask(P, mLineBreakStream, MovedMatches, MatchesByLine);
|
|
|
|
```
|
|
|
|
|
|
|
|
LineStarts can then be identified as the positions immediately after a line break
|
|
|
|
or at the beginning of the file. These are computed by the `LineStartsKernel`.
|
|
|
|
|
|
|
|
```
|
|
|
|
StreamSet * LineStarts = E->CreateStreamSet(1, 1);
|
|
|
|
P->CreateKernelCall<LineStartsKernel>(mLineBreakStream, LineStarts);
|
|
|
|
```
|
|
|
|
|
|
|
|
The starts of the matched lines are now computed by a ```SpreadByMask```.
|
|
|
|
```
|
|
|
|
StreamSet * MatchedLineStarts = E->CreateStreamSet(1, 1);
|
|
|
|
SpreadByMask(E, LineStarts, MatchesByLine, MatchedLineStarts);
|
|
|
|
```
|
|
|
|
|
|
|
|
Now a mask for an entire matched row can be computed, using the LineSpansKernel.
|
|
|
|
```
|
|
|
|
StreamSet * MatchedLineSpans = E->CreateStreamSet(1, 1);
|
|
|
|
P->CreateKernelCall<LineSpansKernel>(MatchedLineStarts, MatchedLineEnds, MatchedLineSpans);
|
|
|
|
```
|
|
|
|
|
|
|
|
If FilterByMask was used at this point, you would get the matched rows. To delete
|
|
|
|
the matched rows, the MatchedLineSpans must be negated (use a Pablo createNot operation).
|
|
|
|
|