Is this the fastest way to group in Pandas?

The following code works well. Just checking: am I using and timing Pandas correctly and is there any faster way? Thanks.

$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> import timeit
>>> pd.__version__
'0.14.1'

def randChar(f, numGrp, N) :
   things = [f%x for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

def randFloat(numGrp, N) :
   things = [round(100*np.random.random(),4) for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

N=int(1e8)
K=100
DF = pd.DataFrame({
  'id1' : randChar("id%03d", K, N),       # large groups (char)
  'id2' : randChar("id%03d", K, N),       # large groups (char)
  'id3' : randChar("id%010d", N//K, N),   # small groups (char)
  'id4' : np.random.choice(K, N),         # large groups (int)
  'id5' : np.random.choice(K, N),         # large groups (int)
  'id6' : np.random.choice(N//K, N),      # small groups (int)            
  'v1' :  np.random.choice(5, N),         # int in range [1,5]
  'v2' :  np.random.choice(5, N),         # int in range [1,5]
  'v3' :  randFloat(100,N)                # numeric e.g. 23.5749
})

Now time 5 different groupings, repeating each one twice to confirm the timing. [I realise timeit(2) runs it twice, but then it reports the total. I'm interested in the time of the first and second run separately.] Python uses about 10G of RAM according to htop during these tests.

>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})"                            ,"from __main__ import DF").timeit(1)
5.604133386000285
>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})"                            ,"from __main__ import DF").timeit(1)
5.505057081000359

>>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})"                      ,"from __main__ import DF").timeit(1)
14.232032927000091
>>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})"                      ,"from __main__ import DF").timeit(1)
14.242601240999647

>>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})"               ,"from __main__ import DF").timeit(1)
22.87025260900009
>>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})"               ,"from __main__ import DF").timeit(1)
22.393589012999655

>>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
2.9725865330001398
>>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
2.9683854739996605

>>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})"    ,"from __main__ import DF").timeit(1)
12.776488024999708
>>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})"    ,"from __main__ import DF").timeit(1)
13.558292575999076

Here is system info :

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2500.048
BogoMIPS:              5066.38
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

$ free -h
             total       used       free     shared    buffers     cached
Mem:          240G        74G       166G       372K        33M       550M
-/+ buffers/cache:        73G       166G
Swap:           0B         0B         0B

I don't believe it's relevant but just in case, the randChar function above is a workaround for a memory error in mtrand.RandomState.choice :

How to solve memory error in mtrand.RandomState.choice?


If you'd like to install the iPython shell, you can easily time your code using %timeit. After installing it, instead of typing python to launch the python interpreter, you would type ipython.

Then you can type your code exactly as you would type it in the normal interpreter (as you did above).

Then you can type, for example:

%timeit DF.groupby(['id1']).agg({'v1':'sum'})

This will accomplish exactly the same thing as what you've done, but if you're using python a lot I find that this will save you significant typing time :).

Ipython has a lot of other nice features (like %paste, which I used to paste in your code and test this, or %run to run a script you've saved in a file), tab completion, etc. http://ipython.org/