|
|
## CSV Parsing
|
|
|
|
|
|
CSV parsing is the process of identifying the beginning and ending of each data field in the
|
|
|
CSV file. Following Parabix methods, we define bit streams for significant positions.
|
|
|
|
|
|
For each field, let its start position be the position of the first character in the field, and
|
|
|
let its follow position be the position immediately after the last character of the field.
|
|
|
For our example, we have the following, where newlines are marked by `⏎`.
|
|
|
|
|
|
```
|
|
|
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
|
|
|
Field starts: 1.........1....1.........1...1........1...............
|
|
|
Field follows: .........1....1.........1...1........1...............1
|
|
|
|
|
|
```
|
|
|
|
|
|
We also need the starts and ends of records.
|
|
|
```
|
|
|
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
|
|
|
Record starts: 1........................1............................
|
|
|
Record follows: ........................1............................1
|
|
|
|
|
|
```
|
|
|
|
|
|
Note that, if a field is empty, the field start and the field follow are
|
|
|
both on the separator.
|
|
|
|
|
|
Rather than calculating and saving these four streams, however, we can define
|
|
|
two streams to give us all the information we need. The first stream is the
|
|
|
field separators and the second stream is the record separators. Note that
|
|
|
we include each record separator as a field separator as well.
|
|
|
|
|
|
```
|
|
|
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
|
|
|
Field separators: .........1....1.........1...1........1...............1
|
|
|
Record separators: ........................1............................1
|
|
|
```
|
|
|
|
|
|
### Dealing with Double Quotes
|
|
|
|
|
|
A key issue in CSV processing is dealing with double quotes (`"`). Double quotes are used
|
|
|
as delimiters at the beginning and end of strings. In addition, a double quote inside a string
|
|
|
is represented by a consecutive pair of double quotes (`""`). Parsing CSV requires that all of
|
|
|
these double quotes be correctly classified as one of four types.
|
|
|
1. The start of a double quoted string field.
|
|
|
2. The end of a double quoted string field.
|
|
|
3. The first double quote of a pair of consecutive double quotes inside a string.
|
|
|
4. The second double quote of a pair of consecutive double quotes inside a string.
|
|
|
|
|
|
To parse using Parabix methods, we can rely on some important observations.
|
|
|
1. A correct CSV file will always have an even number of double quotes.
|
|
|
2. Any double quotes before the start of a string must be balanced, that is even in number.
|
|
|
3. There must be one unmatached double quote prior to any double quote that marks a string end.
|
|
|
4. A double quote that occurs when there is an unmatched prior double quote pending is normally a string end, unless the following character is another double quote, in which case the pair of consecutive double quotes marks a literal double quote character within the string value.
|
|
|
|
|
|
Let `dquote` be a character class bit stream for double quotes. Then the Pablo operation `EveryNth` is
|
|
|
very handy to distinguish between potential starting and ending double quotes. Using a Pablo builder `pb`, these streams can be created by the following operations.
|
|
|
```
|
|
|
PabloAST * dquote_odd = pb.createEveryNth(dquote, pb.getInteger(2));
|
|
|
PabloAST * dquote_even = pb.createXor(dquote, dquote_odd);
|
|
|
|
|
|
```
|
|
|
The `dquote_odd` stream will refer to the 1st, 3rd, 5th, ... double quote in the file, that is the potential
|
|
|
starting quotes.
|
|
|
The `dquote_even` stream refers to 2nd, 4th, 6th and so on, the potential end quotes.
|
|
|
|
|
|
But we also need to deal with consecutive pairs of double quotes that represent literal double quotes to be included in the text. Whenever we see a potential end quote, it is not actually a string end if the immediately following
|
|
|
characters is another double quote.
|
|
|
The Pablo `Lookahead` operation can be used to determine this.
|
|
|
```
|
|
|
PabloAST * quote_escape = pb.createAnd(dquote_even, pb.createLookahead(dquote, 1));
|
|
|
PabloAST * escaped_quote = pb.createAdvance(quote_escape, 1);
|
|
|
PabloAST * start_dquote = pb.createXor(dquote_odd, escaped_quote);
|
|
|
PabloAST * end_dquote = pb.createXor(dquote_even, quote_escape);
|
|
|
```
|
|
|
|
|
|
For example, let us see which streams are marked with the followed CSV text.
|
|
|
```
|
|
|
|
|
|
Data stream: "Free speech",limitation,"Never yell ""Fire!"" in a crowded theatre."
|
|
|
dquote: 1...........1............1...........11.....11......................1
|
|
|
dquote_odd 1........................1............1......1.......................
|
|
|
dquote_even ............1........................1......1.......................1
|
|
|
quote_escape .....................................1......1........................
|
|
|
escaped_quote ......................................1......1.......................
|
|
|
start_dquote 1........................1...........................................
|
|
|
end_dquote ............1.......................................................1
|
|
|
```
|
|
|
|
|
|
### Literal Newlines and Commas
|
|
|
|
|
|
Whenever we have quoted strings in a CSV file, it is possible that the string may
|
|
|
have embedded commas or newline characters. It is important that these embedded
|
|
|
commas and newlines are treated as data characters and not as field delimiters.
|
|
|
|
|
|
The Pablo intrinsic ``InclusiveSpan`` is useful for this. This operation produces a run
|
|
|
of bits marking all positions between opening and closing delimiters. Thus a comma or newline character occurring inside a span from starting dquotes to ending dquotes will always mark a literal character to be included in the string.
|
|
|
|
|
|
For example, if the `Comma` and `LF` streams are streams identifying all commas and newlines within the CSV file, then we can determine which are the actual field and record separators as follows.
|
|
|
|
|
|
```
|
|
|
PabloAST * quoted_data = pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {start_dquote, end_dquote});
|
|
|
PabloAST * unquoted = pb.createNot(quoted_data);
|
|
|
PabloAST * recordMarks = pb.createAnd(LF, unquoted);
|
|
|
PabloAST * fieldMarks = pb.createOr(pb.createAnd(Comma, unquoted), recordMarks);
|
|
|
``` |