cameron · 66f8ca6c
Show whitespace changes
Inline Side-by-side

Showing with 43 additions and 5 deletions

csv2json.md csv2json.md +43 -5

No files found.
--- a/csv2json.md
+++ b/csv2json.md
@@ -45,6 +45,8 @@ A corresponding JSON output could be as follows.
      2.  JSON_separator as a string to be printed in between each record (e.g., `,\n`), and
      3.  JSON_suffix as a string to be printed after at the end of the output (e.g. `]\n`).

+  4.  One more template string '\' is required, in case we need to escape a special character inside a JSON string.
+
 ## CSV Parsing

 CSV parsing is the process of identifying the beginning and ending of each data field in the
@@ -135,9 +137,45 @@ pb.createAnd(newline, pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {s
 ## String Transformation

 In general, the string transformation process requires the modification of the CSV input to insert
-the required JSON template strings.   There are two steps.
-   1.  Expansion of the input to the correct size, inserting zeroes.
-   2.  Filling in template values at the inserted positions.
+the required JSON template strings.   There are three steps.
+   1.  Compression of input to remove undesired non-data characters from the CSV input.
+   2.  Expansion of the compressed input to the correct size, inserting zeroes.
+   3.  Filling in template values at the inserted positions.
+
+These three steps will be implemented using the `FilterByMask`, `SpreadByMask` and `StringReplaceKernel` of the Parabix
+infrastructure.   
+
+### Compression of CSV Input
+
+In this step, CSV syntax marks are removed from the input, to leave the raw CSV data only.
+   1.  Double quotes surrounding fields are deleted.   In general, double quotes will be added back during the expansion step (because they are in the expansion templates).   
+   2.  The quote_escape characters are deleted, but the escaped_quote characters are kept.
+   3.  Commas are deleted.
+   4.  CR before LF should be deleted to uniformly standardize on LF as line separator.
+
+Compression is achieved by creating a mask of 1 bits for all character positions that are to be kept.
+In this mask, all character positions to be deleted are marked with 0 bits.   Call the mask 'CSV_data_mask`.
+
+Given this mask, we need to apply FilterByMask to produce two streamsets:
+   1.  `FilteredBasisBits` is produced by filtering the eight basis bit streams.
+   2.  `FilteredMarks` is produced by filtering two streams produced from parsing.
+        a.  The positions of starts of each record.
+        b.  The positions of the starts of each field.  
+        c.  The positions of any characters that need to be escaped with `\` in the JSON output (escaped_quotes and embedded newlines).
+
+### Expansion of Compressed Input
+
+In this phase, `ExpandedBasisBits` is computed by expansion from `FilteredBasisBits` by insertion of zero bits at marked points.   The number of zero bits to insert is based on the template string to be inserted.
+
+In order to do this, we first need to number the marked positions according to their string template index.
+We create a BixNum stream set for the positions at which insertions.
+
+For the positions marking the starts of each field, we assign consecutive numbers starting with 0.   
+This can be achieved using the pablo `EveryNth` operation.    
+
+Given these numbered values, we can use the Character Class Compiler to compute the bits of another BixNum
+representing the number of 0 bits to insert.
+
+`SpreadByMask` then does the expansion.
+

-These two steps will be implemented using the `SpreadByMask` and `StringReplaceKernel` of the Parabix
-infrastructure.   More information on this process will be provided later.