pandas loc vs. iloc vs. at vs. iat?
Recently began branching out from my safe place (R) into Python and and am a bit confused by the cell localization/selection in Pandas
. I've read the documentation but I'm struggling to understand the practical implications of the various localization/selection options.
Is there a reason why I should ever use .loc
or .iloc
over at
, and iat
or vice versa? In what situations should I use which method?
Note: future readers be aware that this question is old and was written before pandas v0.20 when there used to exist a function called
.ix
. This method was later split into two -loc
andiloc
- to make the explicit distinction between positional and label based indexing. Please beware thatix
was discontinued due to inconsistent behavior and being hard to grok, and no longer exists in current versions of pandas (>= 1.0).
loc: only work on index
iloc: work on position
at: get scalar values. It's a very fast loc
iat: Get scalar values. It's a very fast iloc
Also,
at
andiat
are meant to access a scalar, that is, a single element in the dataframe, whileloc
andiloc
are ments to access several elements at the same time, potentially to perform vectorized operations.
http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html
Updated for pandas
0.20
given that ix
is deprecated. This demonstrates not only how to use loc
, iloc
, at
, iat
, set_value
, but how to accomplish, mixed positional/label based indexing.
loc
- label based
Allows you to pass 1-D arrays as indexers. Arrays can be either slices (subsets) of the index or column, or they can be boolean arrays which are equal in length to the index or columns.
Special Note: when a scalar indexer is passed, loc
can assign a new index or column value that didn't exist before.
# label based, but we can use position values
# to get the labels from the index object
df.loc[df.index[2], 'ColName'] = 3
df.loc[df.index[1:3], 'ColName'] = 3
iloc
- position based
Similar to loc
except with positions rather that index values. However, you cannot assign new columns or indices.
# position based, but we can get the position
# from the columns object via the `get_loc` method
df.iloc[2, df.columns.get_loc('ColName')] = 3
df.iloc[2, 4] = 3
df.iloc[:3, 2:4] = 3
at
- label based
Works very similar to loc
for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns.
Advantage over loc
is that this is faster.
Disadvantage is that you can't use arrays for indexers.
# label based, but we can use position values
# to get the labels from the index object
df.at[df.index[2], 'ColName'] = 3
df.at['C', 'ColName'] = 3
iat
- position based
Works similarly to iloc
. Cannot work in array indexers. Cannot! assign new indices and columns.
Advantage over iloc
is that this is faster.
Disadvantage is that you can't use arrays for indexers.
# position based, but we can get the position
# from the columns object via the `get_loc` method
IBM.iat[2, IBM.columns.get_loc('PNL')] = 3
set_value
- label based
Works very similar to loc
for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns
Advantage Super fast, because there is very little overhead!
Disadvantage There is very little overhead because pandas
is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.
# label based, but we can use position values
# to get the labels from the index object
df.set_value(df.index[2], 'ColName', 3)
set_value
with takable=True
- position based
Works similarly to iloc
. Cannot work in array indexers. Cannot! assign new indices and columns.
Advantage Super fast, because there is very little overhead!
Disadvantage There is very little overhead because pandas
is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.
# position based, but we can get the position
# from the columns object via the `get_loc` method
df.set_value(2, df.columns.get_loc('ColName'), 3, takable=True)
There are two primary ways that pandas makes selections from a DataFrame.
- By Label
- By Integer Location
The documentation uses the term position for referring to integer location. I do not like this terminology as I feel it is confusing. Integer location is more descriptive and is exactly what .iloc
stands for. The key word here is INTEGER - you must use integers when selecting by integer location.
Before showing the summary let's all make sure that ...
.ix is deprecated and ambiguous and should never be used
There are three primary indexers for pandas. We have the indexing operator itself (the brackets []
), .loc
, and .iloc
. Let's summarize them:
-
[]
- Primarily selects subsets of columns, but can select rows as well. Cannot simultaneously select rows and columns. -
.loc
- selects subsets of rows and columns by label only -
.iloc
- selects subsets of rows and columns by integer location only
I almost never use .at
or .iat
as they add no additional functionality and with just a small performance increase. I would discourage their use unless you have a very time-sensitive application. Regardless, we have their summary:
-
.at
selects a single scalar value in the DataFrame by label only -
.iat
selects a single scalar value in the DataFrame by integer location only
In addition to selection by label and integer location, boolean selection also known as boolean indexing exists.
Examples explaining .loc
, .iloc
, boolean selection and .at
and .iat
are shown below
We will first focus on the differences between .loc
and .iloc
. Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each row. Let's take a look at a sample DataFrame:
df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
All the words in bold are the labels. The labels, age
, color
, food
, height
, score
and state
are used for the columns. The other labels, Jane
, Nick
, Aaron
, Penelope
, Dean
, Christina
, Cornelia
are used as labels for the rows. Collectively, these row labels are known as the index.
The primary ways to select particular rows in a DataFrame are with the .loc
and .iloc
indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.
.loc selects data only by labels
We will first talk about the .loc
indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length(number of rows) of the DataFrame.
There are many different inputs you can use for .loc
three out of them are
- A string
- A list of strings
- Slice notation using strings as the start and stop values
Selecting a single row with .loc with a string
To select a single row of data, place the index label inside of the brackets following .loc
.
df.loc['Penelope']
This returns the row of data as a Series
age 4
color white
food Apple
height 80
score 3.3
state AL
Name: Penelope, dtype: object
Selecting multiple rows with .loc with a list of strings
df.loc[['Cornelia', 'Jane', 'Dean']]
This returns a DataFrame with the rows in the order specified in the list:
Selecting multiple rows with .loc with slice notation
Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.
df.loc['Aaron':'Dean']
Complex slices can be taken in the same manner as Python lists.
.iloc selects data only by integer location
Let's now turn to .iloc
. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.
There are many different inputs you can use for .iloc
three out of them are
- An integer
- A list of integers
- Slice notation using integers as the start and stop values
Selecting a single row with .iloc with an integer
df.iloc[4]
This returns the 5th row (integer location 4) as a Series
age 32
color gray
food Cheese
height 180
score 1.8
state AK
Name: Dean, dtype: object
Selecting multiple rows with .iloc with a list of integers
df.iloc[[2, -2]]
This returns a DataFrame of the third and second to last rows:
Selecting multiple rows with .iloc with slice notation
df.iloc[:5:3]
Simultaneous selection of rows and columns with .loc and .iloc
One excellent ability of both .loc/.iloc
is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.
For example, we can select rows Jane, and Dean with just the columns height, score and state like this:
df.loc[['Jane', 'Dean'], 'height':]
This uses a list of labels for the rows and slice notation for the columns
We can naturally do similar operations with .iloc
using only integers.
df.iloc[[1,4], 2]
Nick Lamb
Dean Cheese
Name: food, dtype: object
Simultaneous selection with labels and integer location
.ix
was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.
For instance, if we want to select rows Nick
and Cornelia
along with columns 2 and 4, we could use .loc
by converting the integers to labels with the following:
col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names]
Or alternatively, convert the index labels to integers with the get_loc
index method.
labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]
Boolean Selection
The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows where age is above 30 and return just the food
and score
columns we can do the following:
df.loc[df['age'] > 30, ['food', 'score']]
You can replicate this with .iloc
but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:
df.iloc[(df['age'] > 30).values, [2, 4]]
Selecting all rows
It is possible to use .loc/.iloc
for just column selection. You can select all the rows by using a colon like this:
df.loc[:, 'color':'score':2]
The indexing operator, []
, can slice can select rows and columns too but not simultaneously.
Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.
df['food']
Jane Steak
Nick Lamb
Aaron Mango
Penelope Apple
Dean Cheese
Christina Melon
Cornelia Beans
Name: food, dtype: object
Using a list selects multiple columns
df[['food', 'score']]
What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.
df['Penelope':'Christina'] # slice rows by label
df[2:6:2] # slice rows by integer location
The explicitness of .loc/.iloc
for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.
df[3:5, 'color']
TypeError: unhashable type: 'slice'
Selection by .at
and .iat
Selection with .at
is nearly identical to .loc
but it only selects a single 'cell' in your DataFrame. We usually refer to this cell as a scalar value. To use .at
, pass it both a row and column label separated by a comma.
df.at['Christina', 'color']
'black'
Selection with .iat
is nearly identical to .iloc
but it only selects a single scalar value. You must pass it an integer for both the row and column locations
df.iat[2, 5]
'FL'