Skip column if doesnt exist while creating df pandas python usecols
It appears read_csv throws ValueError when it cant find a column specified in the usecols param. I think you could either use a try catch block and skip the files which throw errors.
for fullPath in listFilenamesPath:
try:
df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
except ValueError:
pass
or catch the error try to parse the conflicting column names and retry with a subset. There is probably a cleaner way to do this.
import pandas as pd
import re
usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
usecols_ = usecols
while usecols_:
try:
df = pd.read_csv(fullPath, sep= ";" , usecols = usecols_)
break
except ValueError as e:
r = re.search(r"\[(.+)\]", str(e))
missing_cols = r.group(1).replace("'","").replace(" ", "").split(",")
usecols_ = [x for x in usecols_ if x not in missing_cols]
"""
rest of your code
"""
A workaround could be to get column names that appear both in your usecols
list (the list of columns you want to look for) as well as df.columns
. You can then use this list of common column names to subset your df
.
The code with necessary comments:
### the column names you want to look for in the dataframes
usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
### read the entire dataframe without usecols
df = pd.read_csv(fullPath, sep= ";")
### get the column names that appear in both usecols list as well as df.columns
final_list = list(set(usecols) & set(df.columns))
### subset it using the final_list
df = df[final_list]
### write your df to csv and continue as usual
df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
nrFiles = nrFiles + 1
print(nrFiles, "files converted")
Demo:
Here is a csv with the df:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
I want to look for the columns:
usecols = ['A', 'D', 'B']
I read the entire CSV. I get the common columns between the df and the columns I am looking for, in this case they are A and B, and subset it as follows:
df = pd.read_csv('test1.csv')
final_list = list(set(cols) & set(df.columns))
df = df[final_list]
print(df)
Output:
B A
0 4 1
1 5 2
2 6 3
This is a bit late, but the usecols
parameter can be a callable function. To quote the docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True.
check_cols = ['name','hostname', 'application family']
df = pd.read_csv(
fullPath,
sep= ";" ,
usecols = lambda x: x in check_cols
)