Unexpected behaviour with a conditional generator expression [duplicate]
Solution 1:
Python's generator expressions are late binding (see PEP 289 -- Generator Expressions) (what the other answers call "lazy"):
Early Binding versus Late Binding
After much discussion, it was decided that the first (outermost) for-expression [of the generator expression] should be evaluated immediately and that the remaining expressions be evaluated when the generator is executed.
[...] Python takes a late binding approach to lambda expressions and has no precedent for automatic, early binding. It was felt that introducing a new paradigm would unnecessarily introduce complexity.
After exploring many possibilities, a consensus emerged that binding issues were hard to understand and that users should be strongly encouraged to use generator expressions inside functions that consume their arguments immediately. For more complex applications, full generator definitions are always superior in terms of being obvious about scope, lifetime, and binding.
That means it only evaluates the outermost for
when creating the generator expression. So it actually binds the value with the name array
in the "subexpression" in array
(in fact it's binding the equivalent to iter(array)
at this point). But when you iterate over the generator the if array.count
call actually refers to what is currently named array
.
Since it's actually a list
not an array
I changed the variable names in the rest of the answer to be more accurate.
In your first case the list
you iterate over and the list
you count in will be different. It's as if you used:
list1 = [1, 2, 2, 4, 5]
list2 = [5, 6, 1, 2, 9]
f = (x for x in list1 if list2.count(x) == 2)
So you check for each element in list1
if its count in list2
is two.
You can easily verify this by modifying the second list:
>>> lst = [1, 2, 2]
>>> f = (x for x in lst if lst.count(x) == 2)
>>> lst = [1, 1, 2]
>>> list(f)
[1]
If it iterated over the first list and counted in the first list it would've returned [2, 2]
(because the first list contains two 2
). If it iterated over and counted in the second list the output should be [1, 1]
. But since it iterates over the first list (containing one 1
) but checks the second list (which contains two 1
s) the output is just a single 1
.
Solution using a generator function
There are several possible solutions, I generally prefer not to use "generator expressions" if they aren't iterated over immediately. A simple generator function will suffice to make it work correctly:
def keep_only_duplicated_items(lst):
for item in lst:
if lst.count(item) == 2:
yield item
And then use it like this:
lst = [1, 2, 2, 4, 5]
f = keep_only_duplicated_items(lst)
lst = [5, 6, 1, 2, 9]
>>> list(f)
[2, 2]
Note that the PEP (see the link above) also states that for anything more complicated a full generator definition is preferrable.
A better solution using a generator function with a Counter
A better solution (avoiding the quadratic runtime behavior because you iterate over the whole array for each element in the array) would be to count (collections.Counter
) the elements once and then do the lookup in constant time (resulting in linear time):
from collections import Counter
def keep_only_duplicated_items(lst):
cnts = Counter(lst)
for item in lst:
if cnts[item] == 2:
yield item
Appendix: Using a subclass to "visualize" what happens and when it happens
It's quite easy to create a list
subclass that prints when specific methods are called, so one can verify that it really works like that.
In this case I just override the methods __iter__
and count
because I'm interested over which list the generator expression iterates and in which list it counts. The method bodies actually just delegate to the superclass and print something (since it uses super
without arguments and f-strings it requires Python 3.6 but it should be easy to adapt for other Python versions):
class MyList(list):
def __iter__(self):
print(f'__iter__() called on {self!r}')
return super().__iter__()
def count(self, item):
cnt = super().count(item)
print(f'count({item!r}) called on {self!r}, result: {cnt}')
return cnt
This is a simple subclass just printing when the __iter__
and count
method are called:
>>> lst = MyList([1, 2, 2, 4, 5])
>>> f = (x for x in lst if lst.count(x) == 2)
__iter__() called on [1, 2, 2, 4, 5]
>>> lst = MyList([5, 6, 1, 2, 9])
>>> print(list(f))
count(1) called on [5, 6, 1, 2, 9], result: 1
count(2) called on [5, 6, 1, 2, 9], result: 1
count(2) called on [5, 6, 1, 2, 9], result: 1
count(4) called on [5, 6, 1, 2, 9], result: 0
count(5) called on [5, 6, 1, 2, 9], result: 1
[]
Solution 2:
As others have mentioned Python generators are lazy. When this line is run:
f = (x for x in array if array.count(x) == 2) # Filters original
nothing actually happens yet. You've just declared how the generator function f will work. Array is not looked at yet. Then, you create a new array that replaces the first one, and finally when you call
print(list(f)) # Outputs filtered
the generator now needs the actual values and starts pulling them from the generator f. But at this point, array already refers to the second one, so you get an empty list.
If you need to reassign the list, and can't use a different variable to hold it, consider creating the list instead of a generator in the second line:
f = [x for x in array if array.count(x) == 2] # Filters original
...
print(f)
Solution 3:
Others have already explained the root cause of the issue - the generator is binding to the name of the array
local variable, rather than its value.
The most pythonic solution is definitely the list comprehension:
f = [x for x in array if array.count(x) == 2]
However, if there is some reason that you don't want to create a list, you can also force a scope close over array
:
f = (lambda array=array: (x for x in array if array.count(x) == 2))()
What's happening here is that the lambda
captures the reference to array
at the time the line is run, ensuring that the generator sees the variable you expect, even if the variable is later redefined.
Note that this still binds to the variable (reference), not the value, so, for example, the following will print [2, 2, 4, 4]
:
array = [1, 2, 2, 4, 5] # Original array
f = (lambda array=array: (x for x in array if array.count(x) == 2))() # Close over array
array.append(4) # This *will* be captured
array = [5, 6, 1, 2, 9] # Updates original to something else
print(list(f)) # Outputs [2, 2, 4, 4]
This is a common pattern in some languages, but it's not very pythonic, so only really makes sense if there's a very good reason for not using the list comprehension (e.g., if array
is very long, or is being used in a nested generator comprehension, and you're concerned about memory).
Solution 4:
You are not using a generator correctly if this is the primary use of this code. Use a list comprehension instead of a generator comprehension. Just replace the parentheses with brackets. It evaluates to a list if you don't know.
array = [1, 2, 2, 4, 5]
f = [x for x in array if array.count(x) == 2]
array = [5, 6, 1, 2, 9]
print(f)
#[2, 2]
You are getting this response because of the nature of a generator. You're calling the generator when it't contents will evaluate to []
Solution 5:
Generators are lazy, they won't be evaluated until you iterate through them. In this case that's at the point you create the list
with the generator as input, at the print
.