python groupby behaviour?
>>from itertools import groupby >>keyfunc = lambda x : x > 500 >>obj = dict(groupby(range(1000), keyfunc)) >>list(obj[True]) [999] >>list(obj[False]) []
range(1000) is obviously sorted by default for the condition (x > 500).
I was expecting the numbers from 0 to 999 to be grouped in a dict by the condition (x > 500). But the resulting dictionary had only 999.
where are the other numbers?.
Can any one explain what is happening here?
Solution 1:
From the docs:
The returned group is itself an iterator that shares the underlying iterable with
groupby()
. Because the source is shared, when thegroupby()
object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list[.]
And you are storing iterators in obj
and materializing them later.
In [21]: dict((k, list(g)) for k, g in groupby(range(10), lambda x : x > 5))
Out[21]: {False: [0, 1, 2, 3, 4, 5], True: [6, 7, 8, 9]}
Solution 2:
The groupby
iterator returns tuples of the outcome of the grouping function and a new iterator that is tied to the same "outer" iterator the groupby
operator is working on. When you apply dict()
to the iterator returned by groupby
without consuming this "inner" iterator, groupby
will have to advance the "outer" iterator for you. You have to realize that the groupby
function does not act on a sequence, it turns any such sequence to an iterator for you.
Perhaps this is better explained with some metaphors and handwaving. Please follow along as we form a bucket line.
Imagine iterators as a person drawing water in buckets from a well. He has an unlimited number of buckets to use, but the well may be finite. Every time you ask this person for a bucket of water, he'll draw a new bucket from the well of water and pass it to you.
In the groupby
case, you insert another person into your budding bucket chain. This person doesn't immediately pass buckets at all. He passes you the outcome of instructions you gave it plus another person every time you ask for a bucket, whom will then pass you buckets via the groupby
person to whomever is asking, as long as they match the same outcome to the instructions. The groupby
bucket passer will stop passing these buckets if the outcome of the instructions changes. So well
gives buckets to groupby
, who passes this to a per-group person, group A
, group B
, and so on.
In your example, the water is numbered, but there can only be 1000 buckets drawn from the well. Here is what happens when you then pass the groupby
person to the dict()
call:
Your
dict()
call asksgroupby
for a bucket. Now,groupby
asks for one bucket from the person at the well, remembers the outcome of the instructions given, holding on to the bucket. Todict()
he'll pass the outcome of the instructions (False
) plus a new person,group A
. The outcome is stored as the key, and thegroup A
person, who wants to pull buckets is stored as the value. This person is not yet asking for buckets however, because no-one is asking it to.Your
dict()
call asksgroupby
for another bucket.groupby
has these instructions, and goes looking for the next bucket where the outcome changes. It was still holding on to the first bucket, no-one asked for it, so it throws away this bucket. Instead, it asks for the next bucket from the well and uses his instructions. The outcome is the same as before, so it throws this new bucket away too! More water goes over the floor, and so go the next 499 buckets. Only when the bucket with number 501 is passed does the outcome change, so nowgroupby
finds another person to give instructions to (persongroup B
), together with the new outcome,True
, passing these two on todict()
.Your
dict()
call storesTrue
as a key, and persongroup B
as the value.group B
does nothing, no-one is asking it for water.Your
dict()
asks for another bucket.groupby
spills more water, until it holds bucket with the number 999, and the person at the well shrugs his shoulders and states that now the well is empty.groupby
tellsdict()
the well is empty, no more buckets are coming, could he please stop asking. It still holds the bucket with number 999, because it never has to make space for the next bucket from the well.Now you come along, asking
dict()
for the thing associated with the keyTrue
, which is persongroup B
. You passgroup B
tolist()
, which will therefore askgroup B
for all the bucketsgroup B
can get.group B
goes back togroupby
, who holds one bucket only, the bucket with number 999, and the outcome of the instructions for this bucket match whatgroup B
is looking for. So this one bucketgroup B
gives tolist()
, then shrugs his shoulders because there are no more buckets, becausegroupby
told him so.You then ask
dict()
for the person associated with the keyFalse
, which is persongroup A
. By now,groupby
has nothing to give any more, the well is dry and he's standing in a puddle of 999 buckets of water with numbers floating around. Your secondlist()
gets nothing.
The moral of this story? Immediately ask for all buckets of water when talking to groupby
, because he'll spill them all if you do not! Iterators are like the brooms in fantasia, diligently moving water without understanding, and you better hope you run out of water if you do not know how to control them.
Here is code that would do what you expect (with a little bit less water to prevent flooding):
>>> from itertools import groupby
>>> keyfunc = lambda x : x > 5
>>> obj = dict((k, list(v)) for k, v in groupby(range(10), keyfunc))
>>> obj(True)
[0, 1, 2, 3, 4, 5]
>>> obj(False)
[6, 7, 8, 9]
Solution 3:
The thing you are missing is, that the groupby-function iterates over your given range(1000)
, thus returning 1000 values. You are only saving the last one, in your case 999
. What you have to do is, is to iterate over the return values and save them to your dictionary:
dictionary = {}
keyfunc = lambda x : x > 500
for k, g in groupby(range(1000), keyfunc):
dictionary[k] = list(g)
So the you would get the expected output:
{False: [0, 1, 2, ...], True: [501, 502, 503, ...]}
For more information, see the Python docs about itertools groupby.