Indexing Pandas data frames: integer rows, named columns
Say df
is a pandas dataframe.
-
df.loc[]
only accepts names -
df.iloc[]
only accepts integers (actual placements) -
df.ix[]
accepts both names and integers:
When referencing rows, df.ix[row_idx, ]
only wants to be given names. e.g.
df = pd.DataFrame({'a' : ['one', 'two', 'three','four', 'five', 'six'],
'1' : np.arange(6)})
df = df.ix[2:6]
print(df)
1 a
2 2 three
3 3 four
4 4 five
5 5 six
df.ix[0, 'a']
throws an error, it doesn't give return 'two'.
When referencing columns, iloc is prefers integers, not names. e.g.
df.ix[2, 1]
returns 'three', not 2. (Although df.idx[2, '1']
does return 2
).
Oddly, I'd like the exact opposite functionality. Usually my column names are very meaningful, so in my code I reference them directly. But due to a lot of observation cleaning, the row names in my pandas data frames don't usually correspond to range(len(df))
.
I realize I can use:
df.iloc[0].loc['a'] # returns three
But it seems ugly! Does anyone know of a better way to do this, so that the code would look like this?
df.foo[0, 'a'] # returns three
In fact, is it possible to add on my own new method to pandas.core.frame.DataFrame
s, so e.g.
df.idx(rows, cols)
is in fact df.iloc[rows].loc[cols]
?
It's a late answer, but @unutbu's comment is still valid and a great solution to this problem.
To index a DataFrame with integer rows and named columns (labeled columns):
df.loc[df.index[#], 'NAME']
where #
is a valid integer index and NAME
is the name of the column.
The existing answers seem short-sighted to me.
Problematic Solutions
-
df.loc[df.index[0], 'a']
The strategy here is to get the row label of the 0th row and then use.loc
as normal. I see two issues.- If df has repeated row labels,
df.loc[df.index[0], 'a']
could return multiple rows. -
.loc
is slower than.iloc
so you're sacrificing speed here.
- If df has repeated row labels,
-
df.reset_index(drop=True).loc[0, 'a']
The strategy here is to reset the index so the row labels become 0, 1, 2, ... thus.loc[0]
gives the same result as.iloc[0]
. Still, the problem here is runtime, as.loc
is slower than.iloc
and you'll incur a cost for resetting the index.
Better Solution
I suggest following @Landmaster's comment:
df.iloc[0, df.columns.get_loc("a")]
Essentially, this is the same as df.iloc[0, 0]
except we get the column index dynamically using df.columns.get_loc("a")
.
To index multiple columns such as ['a', 'b', 'c']
, use:
df.iloc[0, [df.columns.get_loc(c) for c in ['a', 'b', 'c']]]
Update
This is discussed here as part of my course on Pandas.
we can reset the index and then use 0 based indexing like this
df.reset_index(drop=True).loc[0,'a']
edit: removed []
from col name index 'a'
so it just outputs the value
For getting or setting a single value in a DataFrame
by row/column labels, you better use DataFrame.at
instead of DataFrame.loc
, as it is ...
- faster
- you are more explicit about wanting to access only a single value.
How others have already shown, if you start out with an integer position for the row, you still have to find the row-label first with DataFrame.index
as DataFrame.at
only accepts labels:
df.at[df.index[0], 'a']
# Out: 'three'
Benchmark:
%timeit df.at[df.index[0], 'a']
# 7.57 µs ± 30.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.loc[df.index[0], 'a']
# 10.9 µs ± 53.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.iloc[0, df.columns.get_loc("a")]
# 13.3 µs ± 24 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For completeness:
DataFrame.iat
for accessing a single value for a row/column pair by integer position.
A very late answer but it amzed me that pandas still doesn't have such a function after all these years. If it irks you a lot, you can monkey-patch a custom indexer into the DataFrame:
class XLocIndexer:
def __init__(self, frame):
self.frame = frame
def __getitem__(self, key):
row, col = key
return self.frame.iloc[row][col]
pd.core.indexing.IndexingMixin.xloc = property(lambda frame: XLocIndexer(frame))
# Usage
df.xloc[0, 'a'] # one