Want VBA in excel to read very large CSV and create output file of a small subset of the CSV

Solution 1:

The following code should do the trick. I don't have Excel in front of me, so I haven't tested it, but the concept is sound.

If this ends up being too slow, we can look at ways to improve the efficiency.

Sub SelectSomeRecords()
    Dim testLine As String

    Open inputFileName For Input As #1
    Open outputFileName For Output As #2

    While Not EOF(1)
        Line Input #1, testLine
        If RecordIsInteresting(testLine) Then
            Print #2, testLine
        End If
    Wend

    Close #1
    Close #2
End Sub

Function RecordIsInteresting(recordLine As String) As Boolean
    Dim lineItems(1 to 8) As String

    GetRecordItems(lineItems(), recordLine)

    ''// do your custom checking here:
    RecordIsInteresting = lineItems(8) = "LS1 7AA"
End Function

Sub GetRecordItems(items() As String, recordLine as String)
    Dim finishString as Boolean
    Dim itemString as String
    Dim itemIndex as Integer
    Dim charIndex as Long
    Dim inQuote as Boolean
    Dim testChar as String

    inQuote = False
    charIndex = 1
    itemIndex = 1
    itemString = ""
    finishString = False

    While charIndex <= Len(recordLine)
        testChar = Mid$(recordLine, charIndex, 1)

        finishString = False

        If inQuote Then
            If testChar = Chr$(34) Then
                inQuote = False
                finishString = True
                charIndex = charIndex + 1 ''// ignore the next comma
            Else
                itemString = itemString + testChar
            End If
        Else
            If testChar = Chr$(34) Then
                inQuote = True
            ElseIf testChar = "," Then
                finishString = True
            Else
                itemString = itemString + testChar
            End If
        End If

        If finishString Then
            items(itemIndex) = itemString
            itemString = ""
            itemIndex = itemIndex + 1
        End If

        charIndex = charIndex + 1
    Wend
End Sub

Solution 2:

How about VBScript, though this would also work in Excel:

Set cn = CreateObject("ADODB.Connection")

'Note HDR=Yes, that is, first row contains field names '
'and FMT delimted, ie CSV '

strCon="Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\Docs\;" _
& "Extended Properties=""text;HDR=Yes;FMT=Delimited"";"

cn.open strcon

'You would not need delimiters ('') if last field is numeric: '    
strSQL="SELECT FieldName1, FieldName2 INTO New.csv FROM Old.csv " _
& " WHERE LastFieldName='SomeTextValue'"

'Creates new csv file
cn.Execute strSQL

Solution 3:

This doesn't directly answer your question, but grep (or one of the Windows equivalents) would really shine for this, e.g.,

grep -e <regex_filter> foo.csv > bar.csv

Solution 4:

I used the following derivative of the code given above to successfully open an arbitrary csv file from VBA in Excel.

Option Explicit
Public cn As Connection
Public Sub DoIt()
Dim strcon As String
Dim strsql As String
Dim rs As Recordset

Set cn = CreateObject("ADODB.Connection")

strcon = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\bin\HomePlanet\;" _
& "Extended Properties=""text;HDR=Yes;FMT=Delimited"";"

cn.Open strcon

strsql = "SELECT * FROM astuname.csv "
Set rs = New ADODB.Recordset
rs.Open strsql, cn
DoEvents ' pause here to inspect objects and properties rs.Close
End Sub

The rs (recordset) has a collection of fields, with a Count property. Each field as a Type property.

You can reference the fields by sequence number ...

Debug.Print rs.Fields(rs.Fields.Count - 1).Type

Is this sufficient?

If not, post the first several rows of the input file and I'll take it the rest of the way.

Solution 5:

Look at the Input # statement in the Excel help

Sample usage would be:

Input #fnInput, s_Forename, s_Surname, dt_DOB, i_Something, s_Street, s_Town, s_County, s_Postcode

and then use the Write # statement to write matching records out again

The only issue might be that the date format in the output will end up as #1967-07-01# but this format is unambiguous unlike 01/07/1967 which would represent 1st July in the UK and 7th January in the US. If you need to preserve the formatting of the date then write it out as a string:

s_DOB = Format(dt_DOB, "dd/mm/yyyy")