SQL Server Bulk insert of CSV file with inconsistent quotes
Is it possible to BULK INSERT (SQL Server) a CSV file in which the fields are only OCCASSIONALLY surrounded by quotes? Specifically, quotes only surround those fields that contain a ",".
In other words, I have data that looks like this (the first row contain headers):
id, company, rep, employees
729216,INGRAM MICRO INC.,"Stuart, Becky",523
729235,"GREAT PLAINS ENERGY, INC.","Nelson, Beena",114
721177,GEORGE WESTON BAKERIES INC,"Hogan, Meg",253
Because the quotes aren't consistent, I can't use '","' as a delimiter, and I don't know how to create a format file that accounts for this.
I tried using ',' as a delimter and loading it into a temporary table where every column is a varchar, then using some kludgy processing to strip out the quotes, but that doesn't work either, because the fields that contain ',' are split into multiple columns.
Unfortunately, I don't have the ability to manipulate the CSV file beforehand.
Is this hopeless?
Many thanks in advance for any advice.
By the way, i saw this post SQL bulk import from csv, but in that case, EVERY field was consistently wrapped in quotes. So, in that case, he could use ',' as a delimiter, then strip out the quotes afterwards.
Solution 1:
It isn't possible to do a bulk insert for this file, from MSDN:
To be usable as a data file for bulk import, a CSV file must comply with the following restrictions:
- Data fields never contain the field terminator.
- Either none or all of the values in a data field are enclosed in quotation marks ("").
(http://msdn.microsoft.com/en-us/library/ms188609.aspx)
Some simple text processing should be all that's required to get the file ready for import. Alternatively your users could be required to either format the file according to the se guidelines or use something other than a comma as a delimiter (e.g |)
Solution 2:
You are going to need to preprocess the file, period.
If you really really need to do this, here is the code. I wrote this because I absolutely had no choice. It is utility code and I'm not proud of it, but it works. The approach is not to get SQL to understand quoted fields, but instead manipulate the file to use an entirely different delimiter.
EDIT: Here is the code in a github repo. It's been improved and now comes with unit tests! https://github.com/chrisclark/Redelim-it
This function takes an input file and will replace all field-delimiting commas (NOT commas inside quoted-text fields, just the actual delimiting ones) with a new delimiter. You can then tell sql server to use the new field delimiter instead of a comma. In the version of the function here, the placeholder is <TMP> (I feel confident this will not appear in the original csv - if it does, brace for explosions).
Therefore after running this function you import in sql by doing something like:
BULK INSERT MyTable
FROM 'C:\FileCreatedFromThisFunction.csv'
WITH
(
FIELDTERMINATOR = '<*TMP*>',
ROWTERMINATOR = '\n'
)
And without further ado, the terrible, awful function that I apologize in advance for inflicting on you (edit - I've posted a working program that does this instead of just the function on my blog here):
Private Function CsvToOtherDelimiter(ByVal InputFile As String, ByVal OutputFile As String) As Integer
Dim PH1 As String = "<*TMP*>"
Dim objReader As StreamReader = Nothing
Dim count As Integer = 0 'This will also serve as a primary key'
Dim sb As New System.Text.StringBuilder
Try
objReader = New StreamReader(File.OpenRead(InputFile), System.Text.Encoding.Default)
Catch ex As Exception
UpdateStatus(ex.Message)
End Try
If objReader Is Nothing Then
UpdateStatus("Invalid file: " & InputFile)
count = -1
Exit Function
End If
'grab the first line
Dim line = reader.ReadLine()
'and advance to the next line b/c the first line is column headings
If hasHeaders Then
line = Trim(reader.ReadLine)
End If
While Not String.IsNullOrEmpty(line) 'loop through each line
count += 1
'Replace commas with our custom-made delimiter
line = line.Replace(",", ph1)
'Find a quoted part of the line, which could legitimately contain commas.
'In that case we will need to identify the quoted section and swap commas back in for our custom placeholder.
Dim starti = line.IndexOf(ph1 & """", 0)
If line.IndexOf("""",0) = 0 then starti=0
While starti > -1 'loop through quoted fields
Dim FieldTerminatorFound As Boolean = False
'Find end quote token (originally a ",)
Dim endi As Integer = line.IndexOf("""" & ph1, starti)
If endi < 0 Then
FieldTerminatorFound = True
If endi < 0 Then endi = line.Length - 1
End If
While Not FieldTerminatorFound
'Find any more quotes that are part of that sequence, if any
Dim backChar As String = """" 'thats one quote
Dim quoteCount = 0
While backChar = """"
quoteCount += 1
backChar = line.Chars(endi - quoteCount)
End While
If quoteCount Mod 2 = 1 Then 'odd number of quotes. real field terminator
FieldTerminatorFound = True
Else 'keep looking
endi = line.IndexOf("""" & ph1, endi + 1)
End If
End While
'Grab the quoted field from the line, now that we have the start and ending indices
Dim source = line.Substring(starti + ph1.Length, endi - starti - ph1.Length + 1)
'And swap the commas back in
line = line.Replace(source, source.Replace(ph1, ","))
'Find the next quoted field
' If endi >= line.Length - 1 Then endi = line.Length 'During the swap, the length of line shrinks so an endi value at the end of the line will fail
starti = line.IndexOf(ph1 & """", starti + ph1.Length)
End While
line = objReader.ReadLine
End While
objReader.Close()
SaveTextToFile(sb.ToString, OutputFile)
Return count
End Function