|
|
|
# CSV Validation
|
|
|
|
|
|
|
|
Consider the problem of validating CSV rows, assuming that we
|
|
|
|
have a set of regular expressions for the validating each field of the row.
|
|
|
|
|
|
|
|
Suppose we have RE1, RE2 and RE3 as the three regular expressions for
|
|
|
|
validating the content of the three fields of each row.
|
|
|
|
For example:
|
|
|
|
```
|
|
|
|
RE1:[0-9]+
|
|
|
|
RE2:[A-Za-z0-0;., ]*
|
|
|
|
RE3:[A-Z][0-9][A-Z] [0-9][A-Z][0-9]
|
|
|
|
```
|
|
|
|
|
|
|
|
A first attempt might be to create a regular expression combining these three pieces with commas in between.
|
|
|
|
|
|
|
|
```
|
|
|
|
[0-9]+,[A-Za-z0-0;., ]*,[A-Z][0-9][A-Z] [0-9][A-Z][0-9]
|
|
|
|
```
|
|
|
|
|
|
|
|
But there will be some issues with this.
|
|
|
|
1. The second field allows commas, which could only be correct when the field is quoted.
|
|
|
|
|
|
|
|
We could try
|
|
|
|
|
|
|
|
```
|
|
|
|
[0-9]+,"[A-Za-z0-0;., ]*",[A-Z][0-9][A-Z] [0-9][A-Z][0-9]
|
|
|
|
```
|
|
|
|
|
|
|
|
But, in this case, a entry in the second field without quotes would be
|
|
|
|
excluded even if it does not have a comma.
|
|
|
|
|
|
|
|
Then, you can try:
|
|
|
|
|
|
|
|
```
|
|
|
|
[0-9]+,(?:"[A-Za-z0-0;., ]*"|[A-Za-z0-0;. ]*),[A-Z][0-9][A-Z] [0-9][A-Z][0-9]
|
|
|
|
```
|
|
|
|
|
|
|
|
This would work in this case. Other cases may involve new lines within fields and possibly quotes. Those would also require RE modifications.
|
|
|
|
|
|
|
|
In general, each field will require two REs, one with quotes and one without. The quoted REs may allow commas, newlines and quotes inside.
|
|
|
|
|
|
|
|
For the nonquoted RE for a column, we can only allow the field to contain
|
|
|
|
characters that exclude comma, quote and newline. There is a function
|
|
|
|
called ```excludeCC``` in the RE library that would the simplified RE in this case.
|
|
|
|
|
|
|
|
```
|
|
|
|
excludeCC([A-Za-z0-0;., ],[,\n"])
|
|
|
|
```
|
|
|
|
|
|
|
|
In the case that quotes are possible in the RE for a column, then the RE must be modified to double up those quotes.
|
|
|
|
|
|
|
|
If a comma, newline or quote is required, then only the quoted form of RE would be allowed.
|
|
|
|
|
|
|
|
You might also want to allow for leading whitespace, perhaps depending on command parameters.
|
|
|
|
|
|
|
|
You would also need to anchor the match for the full contents of the row.
|
|
|
|
|
|
|
|
```
|
|
|
|
^[0-9]+,(?:"[A-Za-z0-0;., ]*"|[A-Za-z0-0;. ]*),[A-Z][0-9][A-Z] [0-9][A-Z][0-9]$
|
|
|
|
```
|
|
|
|
|
|
|
|
This is not a complete solution, but illustrates some of the issues to be considered and possible approaches. |