|
|
# CSV to JSON
|
|
|
|
|
|
## Introduction
|
|
|
|
|
|
The overall structure of CSV to JSON conversion has 3 phases.
|
|
|
1. Determination of the CSV to JSON translation scheme for the particular file to be converted.
|
|
|
This scheme identifies the number of fields in each record, the corresponding field names for
|
|
|
JSON attributes and any other parameters that govern the translation.
|
|
|
2. Parsing the CSV input to determine the records and fields.
|
|
|
3. Transforming the parsed input according to the scheme to produce JSON output.
|
|
|
|
|
|
|
|
|
## Example
|
|
|
As a simple running example, we use the following CSV input file.
|
|
|
```
|
|
|
Family Name,Given Name,email
|
|
|
Henderson,Paul,ph@sfu.ca
|
|
|
Lin,Qingshan,1234@zju.edu.cn
|
|
|
```
|
|
|
|
|
|
A corresponding JSON output could be as follows.
|
|
|
```
|
|
|
[
|
|
|
{"Family Name":"Henderson","Given Name":"Paul","email":"ph@sfu.ca"},
|
|
|
{"Family Name":"Lin","Given Name":"Qingshan","email":"1234@zju.edu.cn"}
|
|
|
]
|
|
|
```
|
|
|
|
|
|
## CSV to JSON Scheme
|
|
|
1. First determine the number of fields for each record and their field names. Represent as a C++
|
|
|
std::vector<std::string> FieldNames such that the size of the vector is the number of fields.
|
|
|
The list of fields could be taken from the first line of the file or supplied as a program parameter.
|
|
|
Example: Family Name,Given Name,email
|
|
|
|
|
|
2. Define a vector of TemplateStrings in which CSV field values will be embedded to produce the JSON
|
|
|
output records
|
|
|
For example, in a compact fully-quoted mode, the template strings could be as follows.
|
|
|
- `{"Family Name":"`
|
|
|
- `","Given Name":"`
|
|
|
- `","email":"`
|
|
|
- `"}`
|
|
|
|
|
|
3. Define three additional strings for combining JSON records together.
|
|
|
1. JSON_prefix as a string to be printed at the very beginning of JSON output (e.g., `[`),
|
|
|
2. JSON_separator as a string to be printed in between each record (e.g., `,\n`), and
|
|
|
3. JSON_suffix as a string to be printed after at the end of the output (e.g. ']\n`).
|
|
|
|
|
|
## CSV Parsing
|
|
|
|
|
|
CSV parsing is the process of identifying the beginning and ending of each data field in the
|
|
|
CSV file. Following Parabix methods, we define bit streams for significant positions.
|
|
|
|
|
|
For each field, let its start position be the position of the first character in the field, and
|
|
|
let its follow position be the position immediately after the last character of the field.
|
|
|
For our example, we have the following, where newlines are marked by `⏎`.
|
|
|
|
|
|
```
|
|
|
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
|
|
|
Field starts: 1.........1....1.........1...1........1...............
|
|
|
Field follows: .........1....1.........1...1........1...............1
|
|
|
|
|
|
```
|
|
|
|
|
|
We also need the starts and ends of records.
|
|
|
```
|
|
|
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
|
|
|
Record starts: 1........................1............................
|
|
|
Record follows: ........................1............................1
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
## String Transformation
|
|
|
|
|
|
In general, the string transformation process requires the modification of the CSV input to insert
|
|
|
the required JSON template strings. There are two steps.
|
|
|
1. Expansion of the input to the correct size, inserting zeroes.
|
|
|
2. Filling in template values at the inserted positions.
|
|
|
|
|
|
These two steps will be implemented using the `SpreadByMask` and `StringReplaceKernel` of the Parabix
|
|
|
infrastructure. More information on this process will be provided later.
|
|
|
|