Skip column if doesnt exist while creating df pandas python usecols

It appears read_csv throws ValueError when it cant find a column specified in the usecols param. I think you could either use a try catch block and skip the files which throw errors.

for fullPath in listFilenamesPath:
    try:
        df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
    except ValueError:
        pass

or catch the error try to parse the conflicting column names and retry with a subset. There is probably a cleaner way to do this.

import pandas as pd
import re

usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
    usecols_ = usecols
    while usecols_:
        try:
            df = pd.read_csv(fullPath, sep= ";" , usecols = usecols_)
            break
        except ValueError as e:
            r = re.search(r"\[(.+)\]", str(e))
            missing_cols = r.group(1).replace("'","").replace(" ", "").split(",")
            usecols_ = [x for x in usecols_ if x not in missing_cols]   

    """
        rest of your code
    """

A workaround could be to get column names that appear both in your usecols list (the list of columns you want to look for) as well as df.columns. You can then use this list of common column names to subset your df.

The code with necessary comments:

### the column names you want to look for in the dataframes
usecols = ['name','hostname', 'application family']

for fullPath in listFilenamesPath:
    ### read the entire dataframe without usecols
    df = pd.read_csv(fullPath, sep= ";")
    ### get the column names that appear in both usecols list as well as df.columns
    final_list = list(set(usecols) & set(df.columns))
    ### subset it using the final_list
    df = df[final_list]
    ### write your df to csv and continue as usual
    df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
    nrFiles = nrFiles + 1
    print(nrFiles, "files converted")

Demo:

Here is a csv with the df:

I want to look for the columns:

usecols = ['A', 'D', 'B']

I read the entire CSV. I get the common columns between the df and the columns I am looking for, in this case they are A and B, and subset it as follows:

df = pd.read_csv('test1.csv')
final_list = list(set(cols) & set(df.columns))
df = df[final_list]
print(df)

Output:

This is a bit late, but the usecols parameter can be a callable function. To quote the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True.

check_cols = ['name','hostname', 'application family']
df = pd.read_csv(
    fullPath,
    sep= ";" , 
    usecols = lambda x: x in check_cols
)

Skip column if doesnt exist while creating df pandas python usecols

Demo:

Related

Recent Posts