This is an old version of this page. You can view the most recent version or browse the history.

csv2json

CSV to JSON

Introduction

The overall structure of CSV to JSON conversion has 3 phases.

Determination of the CSV to JSON translation scheme for the particular file to be converted. This scheme identifies the number of fields in each record, the corresponding field names for JSON attributes and any other parameters that govern the translation.
Parsing the CSV input to determine the records and fields.
Transforming the parsed input according to the scheme to produce JSON output.

Example

As a simple running example, we use the following CSV input file.

Family Name,Given Name,email
Henderson,Paul,ph@sfu.ca
Lin,Qingshan,1234@zju.edu.cn

A corresponding JSON output could be as follows.

[
{"Family Name":"Henderson","Given Name":"Paul","email":"ph@sfu.ca"},
{"Family Name":"Lin","Given Name":"Qingshan","email":"1234@zju.edu.cn"}
]

CSV to JSON Scheme

First determine the number of fields for each record and their field names. Represent as a C++ string vector FieldNames such that the size of the vector is the number of fields. The list of fields could be taken from the first line of the file or supplied as a program parameter. Example: Family Name,Given Name,email
Define a vector of TemplateStrings in which CSV field values will be embedded to produce the JSON output records For example, in a compact fully-quoted mode, the template strings could be as follows.
- {"Family Name":"
- ","Given Name":"
- ","email":"
- "}
Define three additional strings for combining JSON records together.
1. JSON_prefix as a string to be printed at the very beginning of JSON output (e.g., [),
2. JSON_separator as a string to be printed in between each record (e.g., ,\n), and
3. JSON_suffix as a string to be printed after at the end of the output (e.g. ]\n).

CSV Parsing

CSV parsing is the process of identifying the beginning and ending of each data field in the CSV file. Following Parabix methods, we define bit streams for significant positions.

For each field, let its start position be the position of the first character in the field, and let its follow position be the position immediately after the last character of the field. For our example, we have the following, where newlines are marked by ⏎.

Data stream:     Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field starts:    1.........1....1.........1...1........1...............
Field follows:   .........1....1.........1...1........1...............1

We also need the starts and ends of records.

Data stream:     Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Record starts:   1........................1............................
Record follows:  ........................1............................1

Dealing with Double Quotes

A key issue in CSV processing is dealing with double quotes ("). Double quotes are used as delimiters at the beginning and end of strings. In addition, a double quote inside a string is represented by a consecutive pair of double quotes (""). Parsing CSV requires that all of these double quotes be correctly classified as one of four types.

The start of a double quoted string field.
The end of a double quoted string field.
The first double quote of a pair of consecutive double quotes inside a string.
The second double quote of a pair of consecutive double quotes inside a string.

To parse using Parabix methods, we can rely on some important observations.

A correct CSV file will always have an even number of double quotes.
Any double quotes before the start of a string must be balanced, that is even in number.
There must be one unmatached double quote prior to any double quote that marks a string end.
A double quote that occurs when there is an unmatched prior double quote pending is normally a string end, unless the following character is another double quote, in which case the pair of consecutive double quotes marks a literal double quote character within the string value.

Let dquote be a character class bit stream for double quotes. Then the Pablo operation EveryNth is very handy to distinguish between potential starting and ending double quotes. Using a Pablo builder pb, these streams can be created by the following operations.

    PabloAST * dquote_odd = pb.createEveryNth(dquote, pb.getInteger(2));
    PabloAST * dquote_even = pb.createXor(dquote, end_dquote);

The dquote_odd stream will refer to the 1st, 3rd, 5th, ... double quote in the file, that is the potential starting quotes. The dquote_even stream refers to 2nd, 4th, 6th and so on, the potential end quotes.

But we also need to deal with consecutive pairs of double quotes that represent literal double quotes to be included in the text. Whenever we see a potential end quote, it is not actually a string end if the immediately following characters is another double quote. The Pablo Lookahead operation can be used to determine this.

    PabloAST * quote_escape = pb.createAnd(dquote_even, pb.createLookahead(dquote, 1));
    PabloAST * escaped_quote = pb.createAdvance(quote_escape, 1);
    PabloAST * start_dquote = pb.createXor(dquote_odd, escaped_quote);
    PabloAST * end_dquote = pb.createXor(dquote_even, quote_escape);

For example, let us see which streams are marked with the followed CSV text.


Data stream:     "Free speech",limitation,"Never yell ""Fire!"" in a crowded theatre."
dquote:          1...........1............1...........11.....11......................1
dquote_odd       1........................1............1......1.......................
dquote_even      ............1........................1......1.......................1
quote_escape     .....................................1......1........................
escaped_quote    ......................................1......1.......................
start_dquote     1........................1...........................................
end_dquote       ............1.......................................................1

Newlines: Record Ends or Literal Newline Characters

A second parsing issue for CSV is to identify and classify newline characters as either the ends of CSV records or literal newline characters in a string.

The Pablo intrinsic SpanInclusive is useful for this. This operation produces a run of bits marking all positions between opening and closing delimiters. Thus a newline character occurring inside a span from starting dquotes to ending dquotes will always mark a literal newline in text.

pb.createAnd(newline, pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {start_dquote, end_dquote}))

String Transformation

In general, the string transformation process requires the modification of the CSV input to insert the required JSON template strings. There are two steps.

Expansion of the input to the correct size, inserting zeroes.
Filling in template values at the inserted positions.

These two steps will be implemented using the SpreadByMask and StringReplaceKernel of the Parabix infrastructure. More information on this process will be provided later.