csv2json

CSV to JSON

Introduction

As an example of Parabix methods, consider the task of converting CSV (Comma-Separated Value) files to JSON format (JavaScript Object Notation).

The overall structure of CSV to JSON conversion has 3 phases.

Determination of the CSV to JSON translation scheme for the particular file to be converted. This scheme identifies the number of fields in each record, the corresponding field names for JSON attributes and any other parameters that govern the translation.
Parsing the CSV input to determine the records and fields.
Transforming the parsed input according to the scheme to produce JSON output.

Simple Universal Translation Scheme

One possible translation scheme for CSV to JSON is straightforward and universally applicable to produce a correct JSON format.

Every CSV data record is translated into a JSON object with key-value pairs enclosed in "{}".
CSV headers become the keys for each key-value pair.
The data values in each column of each CSV record are translated into JSON strings enclosed in double quotes (").
The generated JSON objects are stored in one large JSON array.
The JSON objects are displayed one per line, with no other whitespace added to the representation.

Although this scheme is universal, it could be preferable to translate some data values into JSON numbers, booleans (true and false) or the null value instead. Other options involve different whitespace conventions for the JSON output and/or possibly translating some CSV values into structured JSON objects. However, we first consider the universal scheme in order to illustrate the simplest application of Parabix methods to this problem.

Example

As a simple running example, we use the following CSV input file.

Family Name,Given Name,email
Henderson,Paul,ph@sfu.ca
Lin,Qingshan,1234@zju.edu.cn

A corresponding JSON output could be as follows.

[
{"Family Name":"Henderson","Given Name":"Paul","email":"ph@sfu.ca"},
{"Family Name":"Lin","Given Name":"Qingshan","email":"1234@zju.edu.cn"}
]

CSV to JSON Scheme

First determine the number of fields for each record and their field names. Represent as a C++ string vector FieldNames such that the size of the vector is the number of fields. The list of fields could be taken from the first line of the file or supplied as a program parameter. Example: Family Name,Given Name,email
Define a vector of TemplateStrings in which CSV field values will be embedded to produce the JSON output records For example, in a compact fully-quoted mode, the template strings could be as follows.
- {"Family Name":"
- ","Given Name":"
- ","email":"
- "}
Define three additional strings for combining JSON records together.
1. JSON_prefix as a string to be printed at the very beginning of JSON output (e.g., [),
2. JSON_separator as a string to be printed in between each record (e.g., ,\n), and
3. JSON_suffix as a string to be printed after at the end of the output (e.g. ]\n).
One more template string \ is required, in case we need to escape a special character inside a JSON string.

CSV Parsing

The first task in CSV to JSON conversion is to correctly parse the CSV input file. See the CSV Parsing page.

String Transformation

In general, the string transformation process requires the modification of the CSV input to insert the required JSON template strings. There are three steps.

Compression of input to remove undesired non-data characters from the CSV input.
Expansion of the compressed input to the correct size, inserting zeroes.
Filling in template values at the inserted positions.

These three steps will be implemented using the FilterByMask, SpreadByMask and StringReplaceKernel of the Parabix infrastructure.

Compression of CSV Input

In this step, CSV syntax marks are removed from the input, to leave the raw CSV data only.

Double quotes surrounding fields are deleted. In general, double quotes will be added back during the expansion step (because they are in the expansion templates).
The quote_escape characters are deleted, but the escaped_quote characters are kept.
Commas are deleted.
CR before LF should be deleted to uniformly standardize on LF as line separator.

Compression is achieved by creating a mask of 1 bits for all character positions that are to be kept. In this mask, all character positions to be deleted are marked with 0 bits. Call the mask CSV_data_mask.

Given this mask, we need to apply FilterByMask to produce two streamsets:

FilteredBasisBits is produced by filtering the eight basis bit streams.
FilteredMarks is produced by filtering two streams produced from parsing: (a) the positions of starts of each record, (b) the positions of the starts of each field, and (c) the positions of any characters that need to be escaped with \ in the JSON output (escaped_quotes and embedded newlines).

Filtering reduces the length of the stream. For example, for our example above, filtering the field starts yields the following.

Data stream:     "Free speech",limitation,"Never yell ""Fire!"" in a crowded theatre."
CSV_data_mask    .11111111111..1111111111..11111111111.111111.11111111111111111111111.
Field_starts     .1............1...........1..........................................
Filtered_starts  1..........1.........1.......................................

Expansion of Compressed Input

In this phase, ExpandedBasisBits is computed by expansion from FilteredBasisBits by insertion of zero bits at marked points. The number of zero bits to insert is based on the template string to be inserted.

In order to do this, we first need to number the marked positions according to their string template index. We create a BixNum stream set for the positions at which insertions.

For the positions marking the starts of each field, we assign consecutive numbers starting with 0.
This can be achieved using the pablo EveryNth operation, where N is the number of fields in the CSV records. The BixNum values for other template strings are calculated using bitwise logic over the FilteredMarks.

Given these BixNum values for the template strings, we next want to compute the another BixNum representing the number of 0 bits to insert at each position. This can be achieved by the Parabix utility StringInsertBixNum.

Given the number of zeroes to insert at selected positions, InsertionSpreadMask computes a mask that actually has 0 bits inserted at all the desired positions and 1 bits everywhere else. This mask can be used by the SpreadByMask operation to compute the ExpandedBasisBits.

Filling in Template Values

The final step is to use the StringReplaceKernel to fill in template values.

As a guide to this entire process, it may be useful to refer to the icgrep colorization code, which does the insertion of color escape sequences for matched strings.

        std::string ESC = "\x1B";
        std::vector<std::string> colorEscapes = {ESC + "[01;31m" + ESC + "[K", ESC + "[m"};
        unsigned insertLengthBits = 4;

        StreamSet * const InsertBixNum = E->CreateStreamSet(insertLengthBits, 1);
        E->CreateKernelCall<StringInsertBixNum>(colorEscapes, InsertMarks, InsertBixNum);
        //E->CreateKernelCall<DebugDisplayKernel>("InsertBixNum", InsertBixNum);
        StreamSet * const SpreadMask = InsertionSpreadMask(E, InsertBixNum, InsertPosition::Before);
        //E->CreateKernelCall<DebugDisplayKernel>("SpreadMask", SpreadMask);

        // For each run of 0s marking insert positions, create a parallel
        // bixnum sequentially numbering the string insert positions.
        StreamSet * const InsertIndex = E->CreateStreamSet(insertLengthBits);
        E->CreateKernelCall<RunIndex>(SpreadMask, InsertIndex, nullptr, /*invert = */ true);
        //E->CreateKernelCall<DebugDisplayKernel>("InsertIndex", InsertIndex);

        StreamSet * FilteredBasis = E->CreateStreamSet(8, 1);
        E->CreateKernelCall<S2PKernel>(Filtered, FilteredBasis);

        // Baais bit streams expanded with 0 bits for each string to be inserted.
        StreamSet * ExpandedBasis = E->CreateStreamSet(8);
        SpreadByMask(E, SpreadMask, FilteredBasis, ExpandedBasis);
        //E->CreateKernelCall<DebugDisplayKernel>("ExpandedBasis", ExpandedBasis);

        // Map the match start/end marks to their positions in the expanded basis.
        StreamSet * ExpandedMarks = E->CreateStreamSet(2);
        SpreadByMask(E, SpreadMask, InsertMarks, ExpandedMarks);

        StreamSet * ColorizedBasis = E->CreateStreamSet(8);
        E->CreateKernelCall<StringReplaceKernel>(colorEscapes, ExpandedBasis, SpreadMask, ExpandedMarks, InsertIndex, ColorizedBasis);

        StreamSet * ColorizedBytes  = E->CreateStreamSet(1, 8);
        E->CreateKernelCall<P2SKernel>(ColorizedBasis, ColorizedBytes);