... | ... | @@ -69,6 +69,68 @@ Record follows: ........................1............................1 |
|
|
|
|
|
```
|
|
|
|
|
|
### Dealing with Double Quotes
|
|
|
|
|
|
A key issue in CSV processing is dealing with double quotes (`"`). Double quotes are used
|
|
|
as delimiters at the beginning and end of strings. In addition, a double quote inside a string
|
|
|
is represented by a consecutive pair of double quotes (`""`). Parsing CSV requires that all of
|
|
|
these double quotes be correctly classified as one of four types.
|
|
|
1. The start of a double quoted string field.
|
|
|
2. The end of a double quoted string field.
|
|
|
3. The first double quote of a pair of consecutive double quotes inside a string.
|
|
|
4. The second double quote of a pair of consecutive double quotes inside a string.
|
|
|
|
|
|
To parse using Parabix methods, we can rely on some important observations.
|
|
|
1. A correct CSV file will always have an even number of double quotes.
|
|
|
2. Any double quotes before the start of a string must be balanced, that is even in number.
|
|
|
3. There must be one unmatached double quote prior to any double quote that marks a string end.
|
|
|
4. A double quote that occurs when there is an unmatched prior double quote pending is normally a string end, unless the following character is another double quote, in which case the pair of consecutive double quotes marks a literal double quote character within the string value.
|
|
|
|
|
|
Let `dquote` be a character class bit stream for double quotes. Then the Pablo operation `EveryNth` is
|
|
|
very handy to distinguish between potential starting and ending double quotes. Using a Pablo builder `pb`, these streams can be created by the following operations.
|
|
|
```
|
|
|
PabloAST * dquote_odd = pb.createEveryNth(dquote, pb.getInteger(2));
|
|
|
PabloAST * dquote_even = pb.createXor(dquote, end_dquote);
|
|
|
|
|
|
```
|
|
|
The `dquote_odd` stream will refer to the 1st, 3rd, 5th, ... double quote in the file, that is the potential
|
|
|
starting quotes.
|
|
|
The `dquote_even` stream refers to 2nd, 4th, 6th and so on, the potential end quotes.
|
|
|
|
|
|
But we also need to deal with consecutive pairs of double quotes that represent literal double quotes to be included in the text. Whenever we see a potential end quote, it is not actually a string end if the immediately following
|
|
|
characters is another double quote.
|
|
|
The Pablo `Lookahead` operation can be used to determine this.
|
|
|
```
|
|
|
PabloAST * quote_escape = pb.createAnd(dquote_even, pb.createLookahead(dquote, 1));
|
|
|
PabloAST * escaped_quote = pb.createAdvance(quote_escape, 1);
|
|
|
PabloAST * start_dquote = pb.createXor(dquote_odd, escaped_quote);
|
|
|
PabloAST * end_dquote = pb.createXor(dquote_even, quote_escape);
|
|
|
```
|
|
|
|
|
|
For example, let us see which streams are marked with the followed CSV text.
|
|
|
```
|
|
|
|
|
|
Data stream: "Free speech",limitation,"Never yell ""Fire!"" in a crowded theatre."
|
|
|
dquote: 1...........1............1...........11.....11......................1
|
|
|
dquote_odd 1........................1............1......1.......................
|
|
|
dquote_even ............1........................1......1.......................1
|
|
|
quote_escape .....................................1......1........................
|
|
|
escaped_quote ......................................1......1.......................
|
|
|
start_dquote 1........................1...........................................
|
|
|
end_dquote ............1.......................................................1
|
|
|
```
|
|
|
|
|
|
### Newlines: Record Ends or Literal Newline Characters
|
|
|
|
|
|
A second parsing issue for CSV is to identify and classify newline characters as either
|
|
|
the ends of CSV records or literal newline characters in a string.
|
|
|
|
|
|
The Pablo intrinsic ``SpanInclusive`` is useful for this. This operation produces a run
|
|
|
of bits marking all positions between opening and closing delimiters. Thus a newline character
|
|
|
occurring inside a span from starting dquotes to ending dquotes will always mark a literal newline in text.
|
|
|
```
|
|
|
pb.createAnd(newline, pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {start_dquote, end_dquote}))
|
|
|
```
|
|
|
|
|
|
## String Transformation
|
|
|
|
... | ... | @@ -79,4 +141,3 @@ the required JSON template strings. There are two steps. |
|
|
|
|
|
These two steps will be implemented using the `SpreadByMask` and `StringReplaceKernel` of the Parabix
|
|
|
infrastructure. More information on this process will be provided later. |
|
|
|