VLOOKUP/ETL with Python

I have a data that comes from MS SQL Server. The data from the query returns a list of names straight from a public database. For instance, If i wanted records with the name of "Microwave" something like this would happen:

Microwave
Microwvae
Mycrowwave
Microwavee

Microwave would be spelt in hundreds of ways. I solve this currently with a VLOOKUP in excel. It looks for the value on the left cell and returns value on the right. for example:

VLOOKUP(A1,$A$1,$B$4,2,False)
Table:
    A              B
1   Microwave    Microwave
2   Microwvae    Microwave
3   Mycrowwave   Microwave
4   Microwavee   Microwave

I would just copy the VLOOKUP formula down the CSV or Excel file and then use that information for my analysis.

Is there a way in Python to solve this issue in another way?

I could make a long if/elif list or even a replace list and apply it to each line of the csv, but that would save no more time than just using the VLOOKUP. There are thousands of company names spelt wrong and i do not have the clearance to change the database.

So Stack, Any ideas on how to leverage python in this scenario?

If you had have data like this:

+-------------+-----------+
|    typo     |   word    |
+-------------+-----------+
| microweeve  | microwave |
| microweevil | microwave |
| macroworv   | microwave |
| murkeywater | microwave |
+-------------+-----------+

Save it as typo_map.csv

Then run (in the same directory):

import csv

def OpenToDict(path, index):
    with open(path, 'rb') as f:
        reader=csv.reader(f)
        headings = reader.next()
        heading_nums={}
        for i, v in enumerate(headings):
            heading_nums[v]=i
        fields = [heading for heading in headings if heading <> index]
        file_dictionary = {}
        for row in reader:
            file_dictionary[row[heading_nums[index]]]={}
            for field in fields:
                file_dictionary[row[heading_nums[index]]][field]=row[heading_nums[field]]
    return file_dictionary


map = OpenToDict('typo_map.csv', 'typo')

print map['microweevil']['word']

The structure is slightly more complex than it needs to be for your situation but that's because this function was originally written to lookup more than one column. However, it will work for you, and you can simplify it yourself if you want.

VLOOKUP/ETL with Python

Related

Recent Posts