cameron · e84e3296
Hide whitespace changes
Inline Side-by-side

Showing with 106 additions and 0 deletions

CSVparsing.md CSVparsing.md +106 -0

No files found.
--- a/CSVparsing.md
+++ b/CSVparsing.md
+## CSV Parsing
+
+CSV parsing is the process of identifying the beginning and ending of each data field in the
+CSV file.  Following Parabix methods, we define bit streams for significant positions.
+
+For each field, let its start position be the position of the first character in the field, and
+let its follow position be the position immediately after the last character of the field.
+For our example, we have the following, where newlines are marked by `⏎`.
+
+```
+Data stream:     Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
+Field starts:    1.........1....1.........1...1........1...............
+Field follows:   .........1....1.........1...1........1...............1
+
+```
+
+We also need the starts and ends of records.
+```
+Data stream:     Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
+Record starts:   1........................1............................
+Record follows:  ........................1............................1
+
+```
+
+Note that, if a field is empty, the field start and the field follow are
+both on the separator.
+
+Rather than calculating and saving these four streams, however, we can define
+two streams to give us all the information we need.   The first stream is the
+field separators and the second stream is the record separators.   Note that
+we include each record separator as a field separator as well.
+
+```
+Data stream:         Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
+Field separators:    .........1....1.........1...1........1...............1
+Record separators:   ........................1............................1
+```
+
+### Dealing with Double Quotes
+
+A key issue in CSV processing is dealing with double quotes (`"`).   Double quotes are used
+as delimiters at the beginning and end of strings.   In addition, a double quote inside a string
+is represented by a consecutive pair of double quotes (`""`).   Parsing CSV requires that all of
+these double quotes be correctly classified as one of four types.
+ 1.  The start of a double quoted string field.
+ 2.  The end of a double quoted string field.
+ 3.  The first double quote of a pair of consecutive double quotes inside a string.
+ 4.  The second double quote of a pair of consecutive double quotes inside a string.
+
+To parse using Parabix methods, we can rely on some important observations.
+ 1.  A correct CSV file will always have an even number of double quotes.
+ 2.  Any double quotes before the start of a string must be balanced, that is even in number.
+ 3.  There must be one unmatached double quote prior to any double quote that marks a string end.
+ 4.  A double quote that occurs when there is an unmatched prior double quote pending is normally a string end, unless the following character is another double quote, in which case the pair of consecutive double quotes marks a literal double quote character within the string value.
+
+Let `dquote` be a character class bit stream for double quotes.   Then the Pablo operation `EveryNth` is
+very handy to distinguish between potential starting and ending double quotes.   Using a Pablo builder `pb`, these streams can be created by the following operations.
+```
+    PabloAST * dquote_odd = pb.createEveryNth(dquote, pb.getInteger(2));
+    PabloAST * dquote_even = pb.createXor(dquote, dquote_odd);
+
+```
+The `dquote_odd` stream will refer to the 1st, 3rd, 5th, ... double quote in the file, that is the potential
+starting quotes.
+The `dquote_even` stream refers to 2nd, 4th, 6th and so on, the potential end quotes.
+
+But we also need to deal with consecutive pairs of double quotes that represent literal double quotes to be included in the text.  Whenever we see a potential end quote, it is not actually a string end if the immediately following
+characters is another double quote.
+The Pablo `Lookahead` operation can be used to determine this.
+```
+    PabloAST * quote_escape = pb.createAnd(dquote_even, pb.createLookahead(dquote, 1));
+    PabloAST * escaped_quote = pb.createAdvance(quote_escape, 1);
+    PabloAST * start_dquote = pb.createXor(dquote_odd, escaped_quote);
+    PabloAST * end_dquote = pb.createXor(dquote_even, quote_escape);
+```
+
+For example, let us see which streams are marked with the followed CSV text.
+```
+
+Data stream:     "Free speech",limitation,"Never yell ""Fire!"" in a crowded theatre."
+dquote:          1...........1............1...........11.....11......................1
+dquote_odd       1........................1............1......1.......................
+dquote_even      ............1........................1......1.......................1
+quote_escape     .....................................1......1........................
+escaped_quote    ......................................1......1.......................
+start_dquote     1........................1...........................................
+end_dquote       ............1.......................................................1
+```
+
+### Literal Newlines and Commas 
+
+Whenever we have quoted strings in a CSV file, it is possible that the string may
+have embedded commas or newline characters.   It is important that these embedded
+commas and newlines are treated as data characters and not as field delimiters.
+
+The Pablo intrinsic ``InclusiveSpan`` is useful for this.   This operation produces a run
+of bits marking all positions between opening and closing delimiters.   Thus a comma or newline character occurring inside a span from starting dquotes to ending dquotes will always mark a literal character to be included in the string.
+
+For example, if the `Comma` and `LF` streams are streams identifying all commas and newlines within the CSV file, then we can determine which are the actual field and record separators as follows.
+
+```
+PabloAST * quoted_data = pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {start_dquote, end_dquote});
+PabloAST * unquoted = pb.createNot(quoted_data);
+PabloAST * recordMarks = pb.createAnd(LF, unquoted);
+PabloAST * fieldMarks = pb.createOr(pb.createAnd(Comma, unquoted), recordMarks);
+```