cameron · 5121238e
Hide whitespace changes
Inline Side-by-side

Showing with 62 additions and 1 deletion

csv2json.md csv2json.md +62 -1

No files found.
--- a/csv2json.md
+++ b/csv2json.md
@@ -69,6 +69,68 @@ Record follows:  ........................1............................1
 ```
+### Dealing with Double Quotes
+A key issue in CSV processing is dealing with double quotes (`"`).   Double quotes are used
+as delimiters at the beginning and end of strings.   In addition, a double quote inside a string
+is represented by a consecutive pair of double quotes (`""`).   Parsing CSV requires that all of
+these double quotes be correctly classified as one of four types.
+ 1.  The start of a double quoted string field.
+ 2.  The end of a double quoted string field.
+ 3.  The first double quote of a pair of consecutive double quotes inside a string.
+ 4.  The second double quote of a pair of consecutive double quotes inside a string.
+To parse using Parabix methods, we can rely on some important observations.
+ 1.  A correct CSV file will always have an even number of double quotes.
+ 2.  Any double quotes before the start of a string must be balanced, that is even in number.
+ 3.  There must be one unmatached double quote prior to any double quote that marks a string end.
+ 4.  A double quote that occurs when there is an unmatched prior double quote pending is normally a string end, unless the following character is another double quote, in which case the pair of consecutive double quotes marks a literal double quote character within the string value.
+Let `dquote` be a character class bit stream for double quotes.   Then the Pablo operation `EveryNth` is
+very handy to distinguish between potential starting and ending double quotes.   Using a Pablo builder `pb`, these streams can be created by the following operations.
+```
+    PabloAST * dquote_odd = pb.createEveryNth(dquote, pb.getInteger(2));
+    PabloAST * dquote_even = pb.createXor(dquote, end_dquote);
+```
+The `dquote_odd` stream will refer to the 1st, 3rd, 5th, ... double quote in the file, that is the potential
+starting quotes.
+The `dquote_even` stream refers to 2nd, 4th, 6th and so on, the potential end quotes.
+But we also need to deal with consecutive pairs of double quotes that represent literal double quotes to be included in the text.  Whenever we see a potential end quote, it is not actually a string end if the immediately following
+characters is another double quote.
+The Pablo `Lookahead` operation can be used to determine this.
+```
+    PabloAST * quote_escape = pb.createAnd(dquote_even, pb.createLookahead(dquote, 1));
+    PabloAST * escaped_quote = pb.createAdvance(quote_escape, 1);
+    PabloAST * start_dquote = pb.createXor(dquote_odd, escaped_quote);
+    PabloAST * end_dquote = pb.createXor(dquote_even, quote_escape);
+```
+For example, let us see which streams are marked with the followed CSV text.
+```
+Data stream:     "Free speech",limitation,"Never yell ""Fire!"" in a crowded theatre."
+dquote:          1...........1............1...........11.....11......................1
+dquote_odd       1........................1............1......1.......................
+dquote_even      ............1........................1......1.......................1
+quote_escape     .....................................1......1........................
+escaped_quote    ......................................1......1.......................
+start_dquote     1........................1...........................................
+end_dquote       ............1.......................................................1
+```
+### Newlines: Record Ends or Literal Newline Characters
+A second parsing issue for CSV is to identify and classify newline characters as either
+the ends of CSV records or literal newline characters in a string.
+The Pablo intrinsic ``SpanInclusive`` is useful for this.   This operation produces a run
+of bits marking all positions between opening and closing delimiters.   Thus a newline character
+occurring inside a span from starting dquotes to ending dquotes will always mark a literal newline in text.
+```
+pb.createAnd(newline, pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {start_dquote, end_dquote}))
+```
 ## String Transformation
@@ -79,4 +141,3 @@ the required JSON template strings.   There are two steps.
 These two steps will be implemented using the `SpreadByMask` and `StringReplaceKernel` of the Parabix
 infrastructure.   More information on this process will be provided later.