Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
P parabix-devel
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 9
    • Issues 9
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 2
    • Merge requests 2
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • cameron
  • parabix-devel
  • Wiki
  • CSVparsing

CSVparsing · Changes

Page history
Create CSVparsing authored Nov 10, 2021 by cameron's avatar cameron
Hide whitespace changes
Inline Side-by-side
Showing with 106 additions and 0 deletions
+106 -0
  • CSVparsing.md CSVparsing.md +106 -0
  • No files found.
CSVparsing.md 0 → 100644
View page @ e84e3296
## CSV Parsing
CSV parsing is the process of identifying the beginning and ending of each data field in the
CSV file. Following Parabix methods, we define bit streams for significant positions.
For each field, let its start position be the position of the first character in the field, and
let its follow position be the position immediately after the last character of the field.
For our example, we have the following, where newlines are marked by `⏎`.
```
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field starts: 1.........1....1.........1...1........1...............
Field follows: .........1....1.........1...1........1...............1
```
We also need the starts and ends of records.
```
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Record starts: 1........................1............................
Record follows: ........................1............................1
```
Note that, if a field is empty, the field start and the field follow are
both on the separator.
Rather than calculating and saving these four streams, however, we can define
two streams to give us all the information we need. The first stream is the
field separators and the second stream is the record separators. Note that
we include each record separator as a field separator as well.
```
Data stream: Henderson,Paul,ph@sfu.ca⏎Lin,Qingshan,1234@zju.edu.cn⏎
Field separators: .........1....1.........1...1........1...............1
Record separators: ........................1............................1
```
### Dealing with Double Quotes
A key issue in CSV processing is dealing with double quotes (`"`). Double quotes are used
as delimiters at the beginning and end of strings. In addition, a double quote inside a string
is represented by a consecutive pair of double quotes (`""`). Parsing CSV requires that all of
these double quotes be correctly classified as one of four types.
1. The start of a double quoted string field.
2. The end of a double quoted string field.
3. The first double quote of a pair of consecutive double quotes inside a string.
4. The second double quote of a pair of consecutive double quotes inside a string.
To parse using Parabix methods, we can rely on some important observations.
1. A correct CSV file will always have an even number of double quotes.
2. Any double quotes before the start of a string must be balanced, that is even in number.
3. There must be one unmatached double quote prior to any double quote that marks a string end.
4. A double quote that occurs when there is an unmatched prior double quote pending is normally a string end, unless the following character is another double quote, in which case the pair of consecutive double quotes marks a literal double quote character within the string value.
Let `dquote` be a character class bit stream for double quotes. Then the Pablo operation `EveryNth` is
very handy to distinguish between potential starting and ending double quotes. Using a Pablo builder `pb`, these streams can be created by the following operations.
```
PabloAST * dquote_odd = pb.createEveryNth(dquote, pb.getInteger(2));
PabloAST * dquote_even = pb.createXor(dquote, dquote_odd);
```
The `dquote_odd` stream will refer to the 1st, 3rd, 5th, ... double quote in the file, that is the potential
starting quotes.
The `dquote_even` stream refers to 2nd, 4th, 6th and so on, the potential end quotes.
But we also need to deal with consecutive pairs of double quotes that represent literal double quotes to be included in the text. Whenever we see a potential end quote, it is not actually a string end if the immediately following
characters is another double quote.
The Pablo `Lookahead` operation can be used to determine this.
```
PabloAST * quote_escape = pb.createAnd(dquote_even, pb.createLookahead(dquote, 1));
PabloAST * escaped_quote = pb.createAdvance(quote_escape, 1);
PabloAST * start_dquote = pb.createXor(dquote_odd, escaped_quote);
PabloAST * end_dquote = pb.createXor(dquote_even, quote_escape);
```
For example, let us see which streams are marked with the followed CSV text.
```
Data stream: "Free speech",limitation,"Never yell ""Fire!"" in a crowded theatre."
dquote: 1...........1............1...........11.....11......................1
dquote_odd 1........................1............1......1.......................
dquote_even ............1........................1......1.......................1
quote_escape .....................................1......1........................
escaped_quote ......................................1......1.......................
start_dquote 1........................1...........................................
end_dquote ............1.......................................................1
```
### Literal Newlines and Commas
Whenever we have quoted strings in a CSV file, it is possible that the string may
have embedded commas or newline characters. It is important that these embedded
commas and newlines are treated as data characters and not as field delimiters.
The Pablo intrinsic ``InclusiveSpan`` is useful for this. This operation produces a run
of bits marking all positions between opening and closing delimiters. Thus a comma or newline character occurring inside a span from starting dquotes to ending dquotes will always mark a literal character to be included in the string.
For example, if the `Comma` and `LF` streams are streams identifying all commas and newlines within the CSV file, then we can determine which are the actual field and record separators as follows.
```
PabloAST * quoted_data = pb.createIntrinsicCall(pablo::Intrinsic::InclusiveSpan, {start_dquote, end_dquote});
PabloAST * unquoted = pb.createNot(quoted_data);
PabloAST * recordMarks = pb.createAnd(LF, unquoted);
PabloAST * fieldMarks = pb.createOr(pb.createAnd(Comma, unquoted), recordMarks);
```
Clone repository
  • Bracket Matching
  • CSV Validation
  • CSVediting
  • CSVparsing
  • Character Code Compilers
  • KernelLibrary
  • Pablo
  • ParabixTransform
  • Parallel Deletion
  • Parallel Hashing
  • Performance Testing Script
  • Shuffle Pattern Library
  • StaticCCC
  • String Insertion
  • UCD: Unicode Property Database and Compilers
View All Pages