Why isn't my Pandas 'apply' function referencing multiple columns working? [closed]
Solution 1:
Seems you forgot the ''
of your string.
In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)
In [44]: df
Out[44]:
a b c Value
0 -1.674308 foo 0.343801 0.044698
1 -2.163236 bar -2.046438 -0.116798
2 -0.199115 foo -0.458050 -0.199115
3 0.918646 bar -0.007185 -0.001006
4 1.336830 foo 0.534292 0.268245
5 0.976844 bar -0.773630 -0.570417
BTW, in my opinion, following way is more elegant:
In [53]: def my_test2(row):
....: return row['a'] % row['c']
....:
In [54]: df['Value'] = df.apply(my_test2, axis=1)
Solution 2:
If you just want to compute (column a) % (column b), you don't need apply
, just do it directly:
In [7]: df['a'] % df['c']
Out[7]:
0 -1.132022
1 -0.939493
2 0.201931
3 0.511374
4 -0.694647
5 -0.023486
Name: a
Solution 3:
Let's say we want to apply a function add5 to columns 'a' and 'b' of DataFrame df
def add5(x):
return x+5
df[['a', 'b']].apply(add5)
Solution 4:
All of the suggestions above work, but if you want your computations to by more efficient, you should take advantage of numpy vector operations (as pointed out here).
import pandas as pd
import numpy as np
df = pd.DataFrame ({'a' : np.random.randn(6),
'b' : ['foo', 'bar'] * 3,
'c' : np.random.randn(6)})
Example 1: looping with pandas.apply()
:
%%timeit
def my_test2(row):
return row['a'] % row['c']
df['Value'] = df.apply(my_test2, axis=1)
The slowest run took 7.49 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 481 µs per loop
Example 2: vectorize using pandas.apply()
:
%%timeit
df['a'] % df['c']
The slowest run took 458.85 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 70.9 µs per loop
Example 3: vectorize using numpy arrays:
%%timeit
df['a'].values % df['c'].values
The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 6.39 µs per loop
So vectorizing using numpy arrays improved the speed by almost two orders of magnitude.