This is an old version of this page. You can view the most recent version or browse the history.

csv2json

CSV to JSON

Introduction

The overall structure of CSV to JSON conversion has 3 phases.

Determination of the CSV to JSON translation scheme for the particular file to be converted. This scheme identifies the number of fields in each record, the corresponding field names for JSON attributes and any other parameters that govern the translation.
Parsing the CSV input to determine the records and fields.
Transforming the parsed input according to the scheme to produce JSON output.

Example

As a simple running example, we use the following CSV input file.

Family Name,Given Name,email
Henderson,Paul,ph@sfu.ca
Lin,Qingshan,1234@zju.edu.cn

A corresponding JSON output could be as follows.

[
{"Family Name":"Henderson","Given Name":"Paul","email":"ph@sfu.ca"},
{"Family Name":"Lin","Given Name":"Qingshan","email":"1234@zju.edu.cn"}
]

CSV to JSON Scheme

First determine the number of fields for each record and their field names. Represent as a C++ string vector FieldNames such that the size of the vector is the number of fields. The list of fields could be taken from the first line of the file or supplied as a program parameter. Example: Family Name,Given Name,email
Define a vector of TemplateStrings in which CSV field values will be embedded to produce the JSON output records For example, in a compact fully-quoted mode, the template strings could be as follows.
- {"Family Name":"
- ","Given Name":"
- ","email":"
- "}
Define three additional strings for combining JSON records together.
1. JSON_prefix as a string to be printed at the very beginning of JSON output (e.g., [),
2. JSON_separator as a string to be printed in between each record (e.g., ,\n), and
3. JSON_suffix as a string to be printed after at the end of the output (e.g. ]\n).
One more template string \ is required, in case we need to escape a special character inside a JSON string.

CSV Parsing

CSV parsing is the process of identifying the beginning and ending of each data field in the CSV file. Following Parabix methods, we define bit streams for significant positions.

For each field, let its start position be the position of the first character in the field, and let its follow position be the position immediately after the last character of the field. For our example, we have the following, where newlines are marked by ⏎.

Data stream:     Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field starts:    1.........1....1.........1...1........1...............
Field follows:   .........1....1.........1...1........1...............1

We also need the starts and ends of records.

Data stream:     Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Record starts:   1........................1............................
Record follows:  ........................1............................1

Dealing with Double Quotes

A key issue in CSV processing is dealing with double quotes ("). Double quotes are used as delimiters at the beginning and end of strings. In addition, a double quote inside a string is represented by a consecutive pair of double quotes (""). Parsing CSV requires that all of these double quotes be correctly classified as one of four types.

The start of a double quoted string field.
The end of a double quoted string field.
The first double quote of a pair of consecutive double quotes inside a string.
The second double quote of a pair of consecutive double quotes inside a string.

To parse using Parabix methods, we can rely on some important observations.

A correct CSV file will always have an even number of double quotes.
Any double quotes before the start of a string must be balanced, that is even in number.
There must be one unmatached double quote prior to any double quote that marks a string end.
A double quote that occurs when there is an unmatched prior double quote pending is normally a string end, unless the following character is another double quote, in which case the pair of consecutive double quotes marks a literal double quote character within the string value.

Let dquote be a character class bit stream for double quotes. Then the Pablo operation EveryNth is very handy to distinguish between potential starting and ending double quotes. Using a Pablo builder pb, these streams can be created by the following operations.

    PabloAST * dquote_odd = pb.createEveryNth(dquote, pb.getInteger(2));
    PabloAST * dquote_even = pb.createXor(dquote, dquote_odd);

The dquote_odd stream will refer to the 1st, 3rd, 5th, ... double quote in the file, that is the potential starting quotes. The dquote_even stream refers to 2nd, 4th, 6th and so on, the potential end quotes.

But we also need to deal with consecutive pairs of double quotes that represent literal double quotes to be included in the text. Whenever we see a potential end quote, it is not actually a string end if the immediately following characters is another double quote. The Pablo Lookahead operation can be used to determine this.

    PabloAST * quote_escape = pb.createAnd(dquote_even, pb.createLookahead(dquote, 1));
    PabloAST * escaped_quote = pb.createAdvance(quote_escape, 1);
    PabloAST * start_dquote = pb.createXor(dquote_odd, escaped_quote);
    PabloAST * end_dquote = pb.createXor(dquote_even, quote_escape);

For example, let us see which streams are marked with the followed CSV text.


Data stream:     "Free speech",limitation,"Never yell ""Fire!"" in a crowded theatre."
dquote:          1...........1............1...........11.....11......................1
dquote_odd       1........................1............1......1.......................
dquote_even      ............1........................1......1.......................1
quote_escape     .....................................1......1........................
escaped_quote    ......................................1......1.......................
start_dquote     1........................1...........................................
end_dquote       ............1.......................................................1

Newlines: Record Ends or Literal Newline Characters

A second parsing issue for CSV is to identify and classify newline characters as either the ends of CSV records or literal newline characters in a string.

The Pablo intrinsic SpanInclusive is useful for this. This operation produces a run of bits marking all positions between opening and closing delimiters. Thus a newline character occurring inside a span from starting dquotes to ending dquotes will always mark a literal newline in text.

pb.createAnd(newline, pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {start_dquote, end_dquote}))

String Transformation

In general, the string transformation process requires the modification of the CSV input to insert the required JSON template strings. There are three steps.

Compression of input to remove undesired non-data characters from the CSV input.
Expansion of the compressed input to the correct size, inserting zeroes.
Filling in template values at the inserted positions.

These three steps will be implemented using the FilterByMask, SpreadByMask and StringReplaceKernel of the Parabix infrastructure.

Compression of CSV Input

In this step, CSV syntax marks are removed from the input, to leave the raw CSV data only.

Double quotes surrounding fields are deleted. In general, double quotes will be added back during the expansion step (because they are in the expansion templates).
The quote_escape characters are deleted, but the escaped_quote characters are kept.
Commas are deleted.
CR before LF should be deleted to uniformly standardize on LF as line separator.

Compression is achieved by creating a mask of 1 bits for all character positions that are to be kept. In this mask, all character positions to be deleted are marked with 0 bits. Call the mask CSV_data_mask.

Given this mask, we need to apply FilterByMask to produce two streamsets:

FilteredBasisBits is produced by filtering the eight basis bit streams.
FilteredMarks is produced by filtering two streams produced from parsing: (a) the positions of starts of each record, (b) the positions of the starts of each field, and (c) the positions of any characters that need to be escaped with \ in the JSON output (escaped_quotes and embedded newlines).

Filtering reduces the length of the stream. For example, for our example above, filtering the field starts yields the following.

Data stream:     "Free speech",limitation,"Never yell ""Fire!"" in a crowded theatre."
CSV_data_mask    .11111111111..1111111111..11111111111.111111.11111111111111111111111.
Field_starts     .1............1...........1..........................................
Filtered_starts  1..........1.........1.......................................

Expansion of Compressed Input

In this phase, ExpandedBasisBits is computed by expansion from FilteredBasisBits by insertion of zero bits at marked points. The number of zero bits to insert is based on the template string to be inserted.

In order to do this, we first need to number the marked positions according to their string template index. We create a BixNum stream set for the positions at which insertions.

For the positions marking the starts of each field, we assign consecutive numbers starting with 0.
This can be achieved using the pablo EveryNth operation, where N is the number of fields in the CSV records. The BixNum values for other template strings are calculated using bitwise logic over the FilteredMarks.

Given these BixNum values for the template strings, we next want to compute the another BixNum representing the number of 0 bits to insert at each position. This can be achieved by the Parabix utility StringInsertBixNum.

Given the number of zeroes to insert at selected positions, InsertionSpreadMask computes a mask that actually has 0 bits inserted at all the desired positions and 1 bits everywhere else. This mask can be used by the SpreadByMask operation to compute the ExpandedBasisBits.

Filling in Template Values

The final step is to use the StringReplaceKernel to fill in template values.

As a guide to this entire process, it may be useful to refer to the icgrep colorization code, which does the insertion of color escape sequences for matched strings.

        std::string ESC = "\x1B";
        std::vector<std::string> colorEscapes = {ESC + "[01;31m" + ESC + "[K", ESC + "[m"};
        unsigned insertLengthBits = 4;

        StreamSet * const InsertBixNum = E->CreateStreamSet(insertLengthBits, 1);
        E->CreateKernelCall<StringInsertBixNum>(colorEscapes, InsertMarks, InsertBixNum);
        //E->CreateKernelCall<DebugDisplayKernel>("InsertBixNum", InsertBixNum);
        StreamSet * const SpreadMask = InsertionSpreadMask(E, InsertBixNum, InsertPosition::Before);
        //E->CreateKernelCall<DebugDisplayKernel>("SpreadMask", SpreadMask);

        // For each run of 0s marking insert positions, create a parallel
        // bixnum sequentially numbering the string insert positions.
        StreamSet * const InsertIndex = E->CreateStreamSet(insertLengthBits);
        E->CreateKernelCall<RunIndex>(SpreadMask, InsertIndex, nullptr, /*invert = */ true);
        //E->CreateKernelCall<DebugDisplayKernel>("InsertIndex", InsertIndex);

        StreamSet * FilteredBasis = E->CreateStreamSet(8, 1);
        E->CreateKernelCall<S2PKernel>(Filtered, FilteredBasis);

        // Baais bit streams expanded with 0 bits for each string to be inserted.
        StreamSet * ExpandedBasis = E->CreateStreamSet(8);
        SpreadByMask(E, SpreadMask, FilteredBasis, ExpandedBasis);
        //E->CreateKernelCall<DebugDisplayKernel>("ExpandedBasis", ExpandedBasis);

        // Map the match start/end marks to their positions in the expanded basis.
        StreamSet * ExpandedMarks = E->CreateStreamSet(2);
        SpreadByMask(E, SpreadMask, InsertMarks, ExpandedMarks);

        StreamSet * ColorizedBasis = E->CreateStreamSet(8);
        E->CreateKernelCall<StringReplaceKernel>(colorEscapes, ExpandedBasis, SpreadMask, ExpandedMarks, InsertIndex, ColorizedBasis);

        StreamSet * ColorizedBytes  = E->CreateStreamSet(1, 8);
        E->CreateKernelCall<P2SKernel>(ColorizedBasis, ColorizedBytes);