Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
P parabix-devel
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 9
    • Issues 9
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 2
    • Merge requests 2
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • cameron
  • parabix-devel
  • Wiki
  • csv2json

csv2json · Changes

Page history
Update csv2json authored Apr 21, 2020 by cameron's avatar cameron
Hide whitespace changes
Inline Side-by-side
Showing with 62 additions and 1 deletion
+62 -1
  • csv2json.md csv2json.md +62 -1
  • No files found.
csv2json.md
View page @ 5121238e
...@@ -69,6 +69,68 @@ Record follows: ........................1............................1 ...@@ -69,6 +69,68 @@ Record follows: ........................1............................1
``` ```
### Dealing with Double Quotes
A key issue in CSV processing is dealing with double quotes (`"`). Double quotes are used
as delimiters at the beginning and end of strings. In addition, a double quote inside a string
is represented by a consecutive pair of double quotes (`""`). Parsing CSV requires that all of
these double quotes be correctly classified as one of four types.
1. The start of a double quoted string field.
2. The end of a double quoted string field.
3. The first double quote of a pair of consecutive double quotes inside a string.
4. The second double quote of a pair of consecutive double quotes inside a string.
To parse using Parabix methods, we can rely on some important observations.
1. A correct CSV file will always have an even number of double quotes.
2. Any double quotes before the start of a string must be balanced, that is even in number.
3. There must be one unmatached double quote prior to any double quote that marks a string end.
4. A double quote that occurs when there is an unmatched prior double quote pending is normally a string end, unless the following character is another double quote, in which case the pair of consecutive double quotes marks a literal double quote character within the string value.
Let `dquote` be a character class bit stream for double quotes. Then the Pablo operation `EveryNth` is
very handy to distinguish between potential starting and ending double quotes. Using a Pablo builder `pb`, these streams can be created by the following operations.
```
PabloAST * dquote_odd = pb.createEveryNth(dquote, pb.getInteger(2));
PabloAST * dquote_even = pb.createXor(dquote, end_dquote);
```
The `dquote_odd` stream will refer to the 1st, 3rd, 5th, ... double quote in the file, that is the potential
starting quotes.
The `dquote_even` stream refers to 2nd, 4th, 6th and so on, the potential end quotes.
But we also need to deal with consecutive pairs of double quotes that represent literal double quotes to be included in the text. Whenever we see a potential end quote, it is not actually a string end if the immediately following
characters is another double quote.
The Pablo `Lookahead` operation can be used to determine this.
```
PabloAST * quote_escape = pb.createAnd(dquote_even, pb.createLookahead(dquote, 1));
PabloAST * escaped_quote = pb.createAdvance(quote_escape, 1);
PabloAST * start_dquote = pb.createXor(dquote_odd, escaped_quote);
PabloAST * end_dquote = pb.createXor(dquote_even, quote_escape);
```
For example, let us see which streams are marked with the followed CSV text.
```
Data stream: "Free speech",limitation,"Never yell ""Fire!"" in a crowded theatre."
dquote: 1...........1............1...........11.....11......................1
dquote_odd 1........................1............1......1.......................
dquote_even ............1........................1......1.......................1
quote_escape .....................................1......1........................
escaped_quote ......................................1......1.......................
start_dquote 1........................1...........................................
end_dquote ............1.......................................................1
```
### Newlines: Record Ends or Literal Newline Characters
A second parsing issue for CSV is to identify and classify newline characters as either
the ends of CSV records or literal newline characters in a string.
The Pablo intrinsic ``SpanInclusive`` is useful for this. This operation produces a run
of bits marking all positions between opening and closing delimiters. Thus a newline character
occurring inside a span from starting dquotes to ending dquotes will always mark a literal newline in text.
```
pb.createAnd(newline, pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {start_dquote, end_dquote}))
```
## String Transformation ## String Transformation
...@@ -79,4 +141,3 @@ the required JSON template strings. There are two steps. ...@@ -79,4 +141,3 @@ the required JSON template strings. There are two steps.
These two steps will be implemented using the `SpreadByMask` and `StringReplaceKernel` of the Parabix These two steps will be implemented using the `SpreadByMask` and `StringReplaceKernel` of the Parabix
infrastructure. More information on this process will be provided later. infrastructure. More information on this process will be provided later.
Clone repository
  • Bracket Matching
  • CSV Validation
  • CSVediting
  • CSVparsing
  • Character Code Compilers
  • KernelLibrary
  • Pablo
  • ParabixTransform
  • Parallel Deletion
  • Parallel Hashing
  • Performance Testing Script
  • Shuffle Pattern Library
  • StaticCCC
  • String Insertion
  • UCD: Unicode Property Database and Compilers
View All Pages