Is this the fastest way to group in Pandas?
The following code works well. Just checking: am I using and timing Pandas correctly and is there any faster way? Thanks.
$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> import timeit
>>> pd.__version__
'0.14.1'
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
def randFloat(numGrp, N) :
things = [round(100*np.random.random(),4) for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e8)
K=100
DF = pd.DataFrame({
'id1' : randChar("id%03d", K, N), # large groups (char)
'id2' : randChar("id%03d", K, N), # large groups (char)
'id3' : randChar("id%010d", N//K, N), # small groups (char)
'id4' : np.random.choice(K, N), # large groups (int)
'id5' : np.random.choice(K, N), # large groups (int)
'id6' : np.random.choice(N//K, N), # small groups (int)
'v1' : np.random.choice(5, N), # int in range [1,5]
'v2' : np.random.choice(5, N), # int in range [1,5]
'v3' : randFloat(100,N) # numeric e.g. 23.5749
})
Now time 5 different groupings, repeating each one twice to confirm the timing. [I realise timeit(2)
runs it twice, but then it reports the total. I'm interested in the time of the first and second run separately.] Python uses about 10G of RAM according to htop
during these tests.
>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1)
5.604133386000285
>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1)
5.505057081000359
>>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1)
14.232032927000091
>>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1)
14.242601240999647
>>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
22.87025260900009
>>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
22.393589012999655
>>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
2.9725865330001398
>>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
2.9683854739996605
>>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})" ,"from __main__ import DF").timeit(1)
12.776488024999708
>>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})" ,"from __main__ import DF").timeit(1)
13.558292575999076
Here is system info :
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Stepping: 4
CPU MHz: 2500.048
BogoMIPS: 5066.38
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
$ free -h
total used free shared buffers cached
Mem: 240G 74G 166G 372K 33M 550M
-/+ buffers/cache: 73G 166G
Swap: 0B 0B 0B
I don't believe it's relevant but just in case, the randChar
function above is a workaround for a memory error in mtrand.RandomState.choice
:
How to solve memory error in mtrand.RandomState.choice?
If you'd like to install the iPython shell, you can easily time your code using %timeit. After installing it, instead of typing python
to launch the python interpreter, you would type ipython
.
Then you can type your code exactly as you would type it in the normal interpreter (as you did above).
Then you can type, for example:
%timeit DF.groupby(['id1']).agg({'v1':'sum'})
This will accomplish exactly the same thing as what you've done, but if you're using python a lot I find that this will save you significant typing time :).
Ipython has a lot of other nice features (like %paste
, which I used to paste in your code and test this, or %run
to run a script you've saved in a file), tab completion, etc.
http://ipython.org/