Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
P parabix-devel
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 9
    • Issues 9
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 2
    • Merge requests 2
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • cameron
  • parabix-devel
  • Wiki
  • csv2json

csv2json · Changes

Page history
Update csv2json authored May 15, 2020 by cameron's avatar cameron
Hide whitespace changes
Inline Side-by-side
Showing with 43 additions and 5 deletions
+43 -5
  • csv2json.md csv2json.md +43 -5
  • No files found.
csv2json.md
View page @ 66f8ca6c
......@@ -45,6 +45,8 @@ A corresponding JSON output could be as follows.
2. JSON_separator as a string to be printed in between each record (e.g., `,\n`), and
3. JSON_suffix as a string to be printed after at the end of the output (e.g. `]\n`).
4. One more template string '\' is required, in case we need to escape a special character inside a JSON string.
## CSV Parsing
CSV parsing is the process of identifying the beginning and ending of each data field in the
......@@ -135,9 +137,45 @@ pb.createAnd(newline, pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {s
## String Transformation
In general, the string transformation process requires the modification of the CSV input to insert
the required JSON template strings. There are two steps.
1. Expansion of the input to the correct size, inserting zeroes.
2. Filling in template values at the inserted positions.
the required JSON template strings. There are three steps.
1. Compression of input to remove undesired non-data characters from the CSV input.
2. Expansion of the compressed input to the correct size, inserting zeroes.
3. Filling in template values at the inserted positions.
These three steps will be implemented using the `FilterByMask`, `SpreadByMask` and `StringReplaceKernel` of the Parabix
infrastructure.
### Compression of CSV Input
In this step, CSV syntax marks are removed from the input, to leave the raw CSV data only.
1. Double quotes surrounding fields are deleted. In general, double quotes will be added back during the expansion step (because they are in the expansion templates).
2. The quote_escape characters are deleted, but the escaped_quote characters are kept.
3. Commas are deleted.
4. CR before LF should be deleted to uniformly standardize on LF as line separator.
Compression is achieved by creating a mask of 1 bits for all character positions that are to be kept.
In this mask, all character positions to be deleted are marked with 0 bits. Call the mask 'CSV_data_mask`.
Given this mask, we need to apply FilterByMask to produce two streamsets:
1. `FilteredBasisBits` is produced by filtering the eight basis bit streams.
2. `FilteredMarks` is produced by filtering two streams produced from parsing.
a. The positions of starts of each record.
b. The positions of the starts of each field.
c. The positions of any characters that need to be escaped with `\` in the JSON output (escaped_quotes and embedded newlines).
### Expansion of Compressed Input
In this phase, `ExpandedBasisBits` is computed by expansion from `FilteredBasisBits` by insertion of zero bits at marked points. The number of zero bits to insert is based on the template string to be inserted.
In order to do this, we first need to number the marked positions according to their string template index.
We create a BixNum stream set for the positions at which insertions.
For the positions marking the starts of each field, we assign consecutive numbers starting with 0.
This can be achieved using the pablo `EveryNth` operation.
Given these numbered values, we can use the Character Class Compiler to compute the bits of another BixNum
representing the number of 0 bits to insert.
`SpreadByMask` then does the expansion.
These two steps will be implemented using the `SpreadByMask` and `StringReplaceKernel` of the Parabix
infrastructure. More information on this process will be provided later.
Clone repository
  • Bracket Matching
  • CSV Validation
  • CSVediting
  • CSVparsing
  • Character Code Compilers
  • KernelLibrary
  • Pablo
  • ParabixTransform
  • Parallel Deletion
  • Parallel Hashing
  • Performance Testing Script
  • Shuffle Pattern Library
  • StaticCCC
  • String Insertion
  • UCD: Unicode Property Database and Compilers
View All Pages