CSV to JSON
Introduction
The overall structure of CSV to JSON conversion has 3 phases.
- Determination of the CSV to JSON translation scheme for the particular file to be converted. This scheme identifies the number of fields in each record, the corresponding field names for JSON attributes and any other parameters that govern the translation.
- Parsing the CSV input to determine the records and fields.
- Transforming the parsed input according to the scheme to produce JSON output.
Example
As a simple running example, we use the following CSV input file.
Family Name,Given Name,email
Henderson,Paul,ph@sfu.ca
Lin,Qingshan,1234@zju.edu.cn
A corresponding JSON output could be as follows.
[
{"Family Name":"Henderson","Given Name":"Paul","email":"ph@sfu.ca"},
{"Family Name":"Lin","Given Name":"Qingshan","email":"1234@zju.edu.cn"}
]
CSV to JSON Scheme
-
First determine the number of fields for each record and their field names. Represent as a C++ std::vectorstd::string FieldNames such that the size of the vector is the number of fields. The list of fields could be taken from the first line of the file or supplied as a program parameter. Example: Family Name,Given Name,email
-
Define a vector of TemplateStrings in which CSV field values will be embedded to produce the JSON output records For example, in a compact fully-quoted mode, the template strings could be as follows.
{"Family Name":"
","Given Name":"
","email":"
"}
-
Define three additional strings for combining JSON records together.
- JSON_prefix as a string to be printed at the very beginning of JSON output (e.g.,
[
), - JSON_separator as a string to be printed in between each record (e.g.,
,\n
), and - JSON_suffix as a string to be printed after at the end of the output (e.g. ']\n`).
- JSON_prefix as a string to be printed at the very beginning of JSON output (e.g.,
CSV Parsing
CSV parsing is the process of identifying the beginning and ending of each data field in the CSV file. Following Parabix methods, we define bit streams for significant positions.
For each field, let its start position be the position of the first character in the field, and
let its follow position be the position immediately after the last character of the field.
For our example, we have the following, where newlines are marked by ⏎
.
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field starts: 1.........1....1.........1...1........1...............
Field follows: .........1....1.........1...1........1...............1
We also need the starts and ends of records.
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Record starts: 1........................1............................
Record follows: ........................1............................1
String Transformation
In general, the string transformation process requires the modification of the CSV input to insert the required JSON template strings. There are two steps.
- Expansion of the input to the correct size, inserting zeroes.
- Filling in template values at the inserted positions.
These two steps will be implemented using the SpreadByMask
and StringReplaceKernel
of the Parabix
infrastructure. More information on this process will be provided later.