Django Count and Sum annotations interfere with each other
While constructing a complexe QuerySet
with several annotations, I ran into an issue that I could reproduce with the following simple setup.
Here are the models:
class Player(models.Model):
name = models.CharField(max_length=200)
class Unit(models.Model):
player = models.ForeignKey(Player, on_delete=models.CASCADE,
related_name='unit_set')
rarity = models.IntegerField()
class Weapon(models.Model):
unit = models.ForeignKey(Unit, on_delete=models.CASCADE,
related_name='weapon_set')
With my test database, I obtain the following (correct) results:
Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))
[{'id': 1, 'name': 'James', 'weapon_count': 23},
{'id': 2, 'name': 'Max', 'weapon_count': 41},
{'id': 3, 'name': 'Bob', 'weapon_count': 26}]
Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))
[{'id': 1, 'name': 'James', 'rarity_sum': 42},
{'id': 2, 'name': 'Max', 'rarity_sum': 89},
{'id': 3, 'name': 'Bob', 'rarity_sum': 67}]
If I now combine both annotations in the same QuerySet
, I obtain a different (inaccurate) results:
Player.objects.annotate(
weapon_count=Count('unit_set__weapon_set', distinct=True),
rarity_sum=Sum('unit_set__rarity'))
[{'id': 1, 'name': 'James', 'weapon_count': 23, 'rarity_sum': 99},
{'id': 2, 'name': 'Max', 'weapon_count': 41, 'rarity_sum': 183},
{'id': 3, 'name': 'Bob', 'weapon_count': 26, 'rarity_sum': 113}]
Notice how rarity_sum
have now different values than before. Removing distinct=True
does not affect the result. I also tried to use the DistinctSum
function from this answer, in which case all rarity_sum
are set to 18
(also inaccurate).
Why is this? How can I combine both annotations in the same QuerySet
?
Edit: here is the sqlite query generated by the combined QuerySet:
SELECT "sandbox_player"."id",
"sandbox_player"."name",
COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
The data used for the results above is available here.
Solution 1:
This isn't the problem with Django ORM, this is just the way relational databases work. When you're constructing simple querysets like
Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))
or
Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))
ORM does exactly what you expect it to do - join Player
with Weapon
SELECT "sandbox_player"."id", "sandbox_player"."name", COUNT("sandbox_weapon"."id") AS "weapon_count"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit"
ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon"
ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
or Player
with Unit
SELECT "sandbox_player"."id", "sandbox_player"."name", SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
and perform either COUNT
or SUM
aggregation on them.
Note that although the first query has two joins between three tables, the intermediate table Unit
is neither in columns referenced in SELECT
, nor in the GROUP BY
clause. The only role that Unit
plays here is to join Player
with Weapon
.
Now if you look at your third queryset, things get more complicated. Again, as in the first query the joins are between three tables, but now Unit
is referenced in SELECT
as there is SUM
aggregation for Unit.rarity
:
SELECT "sandbox_player"."id",
"sandbox_player"."name",
COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
And this is the crucial difference between the second and the third queries. In the second query, you're joining Player
to Unit
, so a single Unit
will be listed once for each player that it references.
But in the third query you're joining Player
to Unit
and then Unit
to Weapon
, so not only a single Unit
will be listed once for each player that it references, but also for each weapon that references Unit
.
Let's take a look at the simple example:
insert into sandbox_player values (1, "player_1");
insert into sandbox_unit values(1, 10, 1);
insert into sandbox_weapon values (1, 1), (2, 1);
One player, one unit and two weapons that reference the same unit.
Confirm that the problem exists:
>>> from sandbox.models import Player
>>> from django.db.models import Count, Sum
>>> Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2}]>
>>> Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'rarity_sum': 10}]>
>>> Player.objects.annotate(
... weapon_count=Count('unit_set__weapon_set', distinct=True),
... rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 20}]>
From this example it's easy to see that the problem is that in the combined query the unit will be listed twice, one time for each of the weapons referencing it:
sqlite> SELECT "sandbox_player"."id",
...> "sandbox_player"."name",
...> "sandbox_weapon"."id",
...> "sandbox_unit"."rarity"
...> FROM "sandbox_player"
...> LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
...> LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id");
id name id rarity
---------- ---------- ---------- ----------
1 player_1 1 10
1 player_1 2 10
What should you do?
As @ivissani mentioned, one of the easiest solutions would be to write subqueries for each of the aggregations:
>>> from django.db.models import Count, IntegerField, OuterRef, Subquery, Sum
>>> weapon_count = Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).filter(pk=OuterRef('pk'))
>>> rarity_sum = Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).filter(pk=OuterRef('pk'))
>>> qs = Player.objects.annotate(
... weapon_count=Subquery(weapon_count.values('weapon_count'), output_field=IntegerField()),
... rarity_sum=Subquery(rarity_sum.values('rarity_sum'), output_field=IntegerField())
... )
>>> qs.values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 10}]>
which produces the following SQL
SELECT "sandbox_player"."id", "sandbox_player"."name",
(
SELECT COUNT(U2."id") AS "weapon_count"
FROM "sandbox_player" U0
LEFT OUTER JOIN "sandbox_unit" U1
ON (U0."id" = U1."player_id")
LEFT OUTER JOIN "sandbox_weapon" U2
ON (U1."id" = U2."unit_id")
WHERE U0."id" = ("sandbox_player"."id")
GROUP BY U0."id", U0."name"
) AS "weapon_count",
(
SELECT SUM(U1."rarity") AS "rarity_sum"
FROM "sandbox_player" U0
LEFT OUTER JOIN "sandbox_unit" U1
ON (U0."id" = U1."player_id")
WHERE U0."id" = ("sandbox_player"."id")
GROUP BY U0."id", U0."name") AS "rarity_sum"
FROM "sandbox_player"
Solution 2:
A few notes to complement rktavi's excellent answer:
1) This issues has apparently been considered a bug for 10 years already. It is even referred to in the official documentation.
2) While converting my actual project's QuerySets to subqueries (as per rktavi's answer), I noticed that combining bare-bone annotations (for the distinct=True
counts that always worked correctly) with a Subquery
(for the sums) yields extremely long processing (35 sec vs. 100 ms) and incorrect results for the sum. This is true in my actual setup (11 filtered counts on various nested relations and 1 filtered sum on a multiply-nested relation, SQLite3) but cannot be reproduced with the simple models above. This issue can be tricky because another part of your code could add an annotation to your QuerySet (e.g a Table.order_FOO()
function), leading to the issue.
3) With the same setup, I have anecdotical evidence that subquery-type QuerySets are faster compared to bare-bone annotation QuerySets (in cases where you have only distinct=True
counts, of course). I could observe this both with local SQLite3 (83 ms vs 260 ms) and hosted PostgreSQL (320 ms vs 540 ms).
As a result of the above, I will completely avoid using bare-bone annotations in favour of subqueries.
Solution 3:
Based on the excellent answer from @rktavi, I created two helpers classes that simplify the Subquery
/Count
and Subquery
/Sum
patterns:
class SubqueryCount(Subquery):
template = "(SELECT count(*) FROM (%(subquery)s) _count)"
output_field = PositiveIntegerField()
class SubquerySum(Subquery):
template = '(SELECT sum(_sum."%(column)s") FROM (%(subquery)s) _sum)'
def __init__(self, queryset, column, output_field=None, **extra):
if output_field is None:
output_field = queryset.model._meta.get_field(column)
super().__init__(queryset, output_field, column=column, **extra)
One can use these helpers like so:
from django.db.models import OuterRef
weapons = Weapon.objects.filter(unit__player_id=OuterRef('id'))
units = Unit.objects.filter(player_id=OuterRef('id'))
qs = Player.objects.annotate(weapon_count=SubqueryCount(weapons),
rarity_sum=SubquerySum(units, 'rarity'))
Solution 4:
Thanks @rktavi for your amazing answer!!
Here's my use case:
Using Django DRF.
I needed to get Sum and Count from different FK's inside the annotate so that it would all be part of one queryset in order to add these fields to the ordering_fields in DRF.
The Sum and Count were clashing and returning wrong results. Your answer really helped me put it all together.
The annotate was occasionally returning the dates as strings
, so I needed to Cast it to DateTimeField.
donation_filter = Q(payments__status='donated') & ~Q(payments__payment_type__payment_type='coupon')
total_donated_SQ = User.objects.annotate(total_donated=Sum('payments__sum', filter=donation_filter )).filter(pk=OuterRef('pk'))
message_count_SQ = User.objects.annotate(message_count=Count('events__id', filter=Q(events__event_id=6))).filter(pk=OuterRef('pk'))
queryset = User.objects.annotate(
total_donated=Subquery(total_donated_SQ.values('total_donated'), output_field=IntegerField()),
last_donation_date=Cast(Max('payments__updated', filter=donation_filter ), output_field=DateTimeField()),
message_count=Subquery(message_count_SQ.values('message_count'), output_field=IntegerField()),
last_message_date=Cast(Max('events__updated', filter=Q(events__event_id=6)), output_field=DateTimeField())
)