... | ... | @@ -45,6 +45,8 @@ A corresponding JSON output could be as follows. |
|
|
2. JSON_separator as a string to be printed in between each record (e.g., `,\n`), and
|
|
|
3. JSON_suffix as a string to be printed after at the end of the output (e.g. `]\n`).
|
|
|
|
|
|
4. One more template string '\' is required, in case we need to escape a special character inside a JSON string.
|
|
|
|
|
|
## CSV Parsing
|
|
|
|
|
|
CSV parsing is the process of identifying the beginning and ending of each data field in the
|
... | ... | @@ -135,9 +137,45 @@ pb.createAnd(newline, pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {s |
|
|
## String Transformation
|
|
|
|
|
|
In general, the string transformation process requires the modification of the CSV input to insert
|
|
|
the required JSON template strings. There are two steps.
|
|
|
1. Expansion of the input to the correct size, inserting zeroes.
|
|
|
2. Filling in template values at the inserted positions.
|
|
|
the required JSON template strings. There are three steps.
|
|
|
1. Compression of input to remove undesired non-data characters from the CSV input.
|
|
|
2. Expansion of the compressed input to the correct size, inserting zeroes.
|
|
|
3. Filling in template values at the inserted positions.
|
|
|
|
|
|
These three steps will be implemented using the `FilterByMask`, `SpreadByMask` and `StringReplaceKernel` of the Parabix
|
|
|
infrastructure.
|
|
|
|
|
|
### Compression of CSV Input
|
|
|
|
|
|
In this step, CSV syntax marks are removed from the input, to leave the raw CSV data only.
|
|
|
1. Double quotes surrounding fields are deleted. In general, double quotes will be added back during the expansion step (because they are in the expansion templates).
|
|
|
2. The quote_escape characters are deleted, but the escaped_quote characters are kept.
|
|
|
3. Commas are deleted.
|
|
|
4. CR before LF should be deleted to uniformly standardize on LF as line separator.
|
|
|
|
|
|
Compression is achieved by creating a mask of 1 bits for all character positions that are to be kept.
|
|
|
In this mask, all character positions to be deleted are marked with 0 bits. Call the mask 'CSV_data_mask`.
|
|
|
|
|
|
Given this mask, we need to apply FilterByMask to produce two streamsets:
|
|
|
1. `FilteredBasisBits` is produced by filtering the eight basis bit streams.
|
|
|
2. `FilteredMarks` is produced by filtering two streams produced from parsing.
|
|
|
a. The positions of starts of each record.
|
|
|
b. The positions of the starts of each field.
|
|
|
c. The positions of any characters that need to be escaped with `\` in the JSON output (escaped_quotes and embedded newlines).
|
|
|
|
|
|
### Expansion of Compressed Input
|
|
|
|
|
|
In this phase, `ExpandedBasisBits` is computed by expansion from `FilteredBasisBits` by insertion of zero bits at marked points. The number of zero bits to insert is based on the template string to be inserted.
|
|
|
|
|
|
In order to do this, we first need to number the marked positions according to their string template index.
|
|
|
We create a BixNum stream set for the positions at which insertions.
|
|
|
|
|
|
For the positions marking the starts of each field, we assign consecutive numbers starting with 0.
|
|
|
This can be achieved using the pablo `EveryNth` operation.
|
|
|
|
|
|
Given these numbered values, we can use the Character Class Compiler to compute the bits of another BixNum
|
|
|
representing the number of 0 bits to insert.
|
|
|
|
|
|
`SpreadByMask` then does the expansion.
|
|
|
|
|
|
|
|
|
These two steps will be implemented using the `SpreadByMask` and `StringReplaceKernel` of the Parabix
|
|
|
infrastructure. More information on this process will be provided later. |