Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
P parabix-devel
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 9
    • Issues 9
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 2
    • Merge requests 2
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • cameron
  • parabix-devel
  • Wiki
  • CSV Validation

Last edited by cameron Nov 30, 2022
Page history

CSV Validation

CSV Validation

Consider the problem of validating CSV rows, assuming that we have a set of regular expressions for the validating each field of the row.

Suppose we have RE1, RE2 and RE3 as the three regular expressions for validating the content of the three fields of each row. For example:

RE1:[0-9]+
RE2:[A-Za-z0-0;., ]*
RE3:[A-Z][0-9][A-Z] [0-9][A-Z][0-9]

A first attempt might be to create a regular expression combining these three pieces with commas in between.

[0-9]+,[A-Za-z0-0;., ]*,[A-Z][0-9][A-Z] [0-9][A-Z][0-9]

But there will be some issues with this.

  1. The second field allows commas, which could only be correct when the field is quoted.

We could try

[0-9]+,"[A-Za-z0-0;., ]*",[A-Z][0-9][A-Z] [0-9][A-Z][0-9]

But, in this case, a entry in the second field without quotes would be excluded even if it does not have a comma.

Then, you can try:

[0-9]+,(?:"[A-Za-z0-0;., ]*"|[A-Za-z0-0;. ]*),[A-Z][0-9][A-Z] [0-9][A-Z][0-9]

This would work in this case. Other cases may involve new lines within fields and possibly quotes. Those would also require RE modifications.

In general, each field will require two REs, one with quotes and one without. The quoted REs may allow commas, newlines and quotes inside.

For the nonquoted RE for a column, we can only allow the field to contain characters that exclude comma, quote and newline. There is a function called excludeCC in the RE library that would the simplified RE in this case.

excludeCC([A-Za-z0-0;., ],[,\n"])

In the case that quotes are possible in the RE for a column, then the RE must be modified to double up those quotes.

If a comma, newline or quote is required, then only the quoted form of RE would be allowed.

You might also want to allow for leading whitespace, perhaps depending on command parameters.

You would also need to anchor the match for the full contents of the row.

^[0-9]+,(?:"[A-Za-z0-0;., ]*"|[A-Za-z0-0;. ]*),[A-Z][0-9][A-Z] [0-9][A-Z][0-9]$

This is not a complete solution, but illustrates some of the issues to be considered and possible approaches.

Clone repository
  • Bracket Matching
  • CSV Validation
  • CSVediting
  • CSVparsing
  • Character Code Compilers
  • KernelLibrary
  • Pablo
  • ParabixTransform
  • Parallel Deletion
  • Parallel Hashing
  • Performance Testing Script
  • Shuffle Pattern Library
  • StaticCCC
  • String Insertion
  • UCD: Unicode Property Database and Compilers
View All Pages