Key error when selecting columns in pandas dataframe after read_csv

I'm trying to read in a CSV file into a pandas dataframe and select a column, but keep getting a key error.

The file reads in successfully and I can view the dataframe in an iPython notebook, but when I want to select a column any other than the first one, it throws a key error.

I am using this code:

import pandas as pd

transactions = pd.read_csv('transactions.csv',low_memory=False, delimiter=',', header=0, encoding='ascii')
transactions['quarter']

This is the file I'm working on: https://www.dropbox.com/s/81iwm4f2hsohsq3/transactions.csv?dl=0

Thank you!


use sep='\s*,\s*' so that you will take care of spaces in column-names:

transactions = pd.read_csv('transactions.csv', sep=r'\s*,\s*',
                           header=0, encoding='ascii', engine='python')

alternatively you can make sure that you don't have unquoted spaces in your CSV file and use your command (unchanged)

prove:

print(transactions.columns.tolist())

Output:

['product_id', 'customer_id', 'store_id', 'promotion_id', 'month_of_year', 'quarter', 'the_year', 'store_sales', 'store_cost', 'unit_sales', 'fact_count']

if you need to select multiple columns from dataframe use 2 pairs of square brackets eg.

df[["product_id","customer_id","store_id"]]

I met the same problem that key errors occur when filtering the columns after reading from CSV.

Reason

The main reason of these problems is the extra initial white spaces in your CSV files. (found in your uploaded CSV file, e.g. , customer_id, store_id, promotion_id, month_of_year, )

Proof

To prove this, you could try print(list(df.columns)) and the names of columns must be ['product_id', ' customer_id', ' store_id', ' promotion_id', ' month_of_year', ...].

Solution

The direct way to solve this is to add the parameter in pd.read_csv(), for example:

pd.read_csv('transactions.csv', 
            sep = r',', 
            skipinitialspace = True)

Reference: https://stackoverflow.com/a/32704818/16268870


The key error generally comes if the key doesn't match any of the dataframe column name 'exactly':

You could also try:

import csv
import pandas as pd
import re
    with open (filename, "r") as file:
        df = pd.read_csv(file, delimiter = ",")
        df.columns = ((df.columns.str).replace("^ ","")).str.replace(" $","")
        print(df.columns)