CSVparsing

CSV Parsing

CSV parsing is the process of identifying the beginning and ending of each data field in the CSV file. Following Parabix methods, we define bit streams for significant positions.

For each field, let its start position be the position of the first character in the field, and let its follow position be the position immediately after the last character of the field. For our example, we have the following, where newlines are marked by ⏎.

Data stream:     Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field starts:    1.........1....1.........1...1........1...............
Field follows:   .........1....1.........1...1........1...............1

We also need the starts and ends of records.

Data stream:     Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Record starts:   1........................1............................
Record follows:  ........................1............................1

Note that, if a field is empty, the field start and the field follow are both on the separator.

Rather than calculating and saving these four streams, however, we can define two streams to give us all the information we need. The first stream is the field separators and the second stream is the record separators. Note that we include each record separator as a field separator as well.

Data stream:         Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field separators:    .........1....1.........1...1........1...............1
Record separators:   ........................1............................1

Dealing with Double Quotes

A key issue in CSV processing is dealing with double quotes ("). Double quotes are used as delimiters at the beginning and end of strings. In addition, a double quote inside a string is represented by a consecutive pair of double quotes (""). Parsing CSV requires that all of these double quotes be correctly classified as one of four types.

The start of a double quoted string field.
The end of a double quoted string field.
The first double quote of a pair of consecutive double quotes inside a string.
The second double quote of a pair of consecutive double quotes inside a string.

To parse using Parabix methods, we can rely on some important observations.

A correct CSV file will always have an even number of double quotes.
Any double quotes before the start of a string must be balanced, that is even in number.
There must be one unmatached double quote prior to any double quote that marks a string end.
A double quote that occurs when there is an unmatched prior double quote pending is normally a string end, unless the following character is another double quote, in which case the pair of consecutive double quotes marks a literal double quote character within the string value.

Let dquote be a character class bit stream for double quotes. Then the Pablo operation EveryNth is very handy to distinguish between potential starting and ending double quotes. Using a Pablo builder pb, these streams can be created by the following operations.

    PabloAST * dquote_odd = pb.createEveryNth(dquote, pb.getInteger(2));
    PabloAST * dquote_even = pb.createXor(dquote, dquote_odd);

The dquote_odd stream will refer to the 1st, 3rd, 5th, ... double quote in the file, that is the potential starting quotes. The dquote_even stream refers to 2nd, 4th, 6th and so on, the potential end quotes.

But we also need to deal with consecutive pairs of double quotes that represent literal double quotes to be included in the text. Whenever we see a potential end quote, it is not actually a string end if the immediately following characters is another double quote. The Pablo Lookahead operation can be used to determine this.

    PabloAST * quote_escape = pb.createAnd(dquote_even, pb.createLookahead(dquote, 1));
    PabloAST * escaped_quote = pb.createAdvance(quote_escape, 1);
    PabloAST * start_dquote = pb.createXor(dquote_odd, escaped_quote);
    PabloAST * end_dquote = pb.createXor(dquote_even, quote_escape);

For example, let us see which streams are marked with the followed CSV text.


Data stream:     "Free speech",limitation,"Never yell ""Fire!"" in a crowded theatre."
dquote:          1...........1............1...........11.....11......................1
dquote_odd       1........................1............1......1.......................
dquote_even      ............1........................1......1.......................1
quote_escape     .....................................1......1........................
escaped_quote    ......................................1......1.......................
start_dquote     1........................1...........................................
end_dquote       ............1.......................................................1

Literal Newlines and Commas

Whenever we have quoted strings in a CSV file, it is possible that the string may have embedded commas or newline characters. It is important that these embedded commas and newlines are treated as data characters and not as field delimiters.

The Pablo intrinsic InclusiveSpan is useful for this. This operation produces a run of bits marking all positions between opening and closing delimiters. Thus a comma or newline character occurring inside a span from starting dquotes to ending dquotes will always mark a literal character to be included in the string.

For example, if the Comma and LF streams are streams identifying all commas and newlines within the CSV file, then we can determine which are the actual field and record separators as follows.

PabloAST * quoted_data = pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {start_dquote, end_dquote});
PabloAST * unquoted = pb.createNot(quoted_data);
PabloAST * record_separators = pb.createAnd(LF, unquoted);
PabloAST * field_separators = pb.createOr(pb.createAnd(Comma, unquoted), record_separators);

Having determined the field_separators and record_separators in this way, we have now completed the basic task of CSV parsing.