Multiple array_agg() calls in a single query
I'm trying to accomplish something with my query but it's not really working. My application used to have a mongo db so the application is used to get arrays in a field, now we had to change to Postgres and I don't want to change my applications code to keep v1 working.
In order to get arrays in 1 field within Postgres I used array_agg()
function. And this worked fine so far. However, I'm at a point where I need another array in a field from another different table.
For example:
I have my employees. employees have multiple address and have multiple workdays.
SELECT name, age, array_agg(ad.street) FROM employees e
JOIN address ad ON e.id = ad.employeeid
GROUP BY name, age
Now this worked fine for me, this would result in for example:
| name | age| array_agg(ad.street)
| peter | 25 | {1st street, 2nd street}|
Now I want to join another table for working days so I do:
SELECT name, age, array_agg(ad.street), arrag_agg(wd.day) FROM employees e
JOIN address ad ON e.id = ad.employeeid
JOIN workingdays wd ON e.id = wd.employeeid
GROUP BY name, age
This results in:
| peter | 25 | {1st street, 1st street, 1st street, 1st street, 1st street, 2nd street, 2nd street, 2nd street, 2nd street, 2nd street}| "{Monday,Tuesday,Wednesday,Thursday,Friday,Monday,Tuesday,Wednesday,Thursday,Friday}
But I need it to result:
| peter | 25 | {1st street, 2nd street}| {Monday,Tuesday,Wednesday,Thursday,Friday}
I understand it has to do with my joins, because of the multiple joins the rows multiple but I don't know how to accomplish this, can anyone give me the correct tip?
Solution 1:
DISTINCT
is often applied to repair queries that are rotten from the inside, and that's often slow and / or incorrect. Don't multiply rows to begin with, then you don't have to sort out unwanted duplicates at the end.
Joining to multiple n-tables ("has many") at once multiplies rows in the result set. That's like a CROSS JOIN
or Cartesian product by proxy:
- Two SQL LEFT JOINS produce incorrect result
There are various ways to avoid this mistake.
Aggregate first, join later
Technically, the query works as long as you join to one table with multiple rows at a time before you aggregate:
SELECT e.id, e.name, e.age, e.streets, arrag_agg(wd.day) AS days
FROM (
SELECT e.id, e.name, e.age, array_agg(ad.street) AS streets
FROM employees e
JOIN address ad ON ad.employeeid = e.id
GROUP BY e.id -- id enough if it is defined PK
) e
JOIN workingdays wd ON wd.employeeid = e.id
GROUP BY e.id, e.name, e.age;
It's also best to include the primary key id
and GROUP BY
it, because name
and age
are not necessarily unique. You could merge two employees by mistake.
But you can aggregate in a subquery before you join, that's superior unless you have selective WHERE
conditions on employees
:
SELECT e.id, e.name, e.age, ad.streets, arrag_agg(wd.day) AS days
FROM employees e
JOIN (
SELECT employeeid, array_agg(ad.street) AS streets
FROM address
GROUP BY 1
) ad ON ad.employeeid = e.id
JOIN workingdays wd ON e.id = wd.employeeid
GROUP BY e.id, e.name, e.age, ad.streets;
Or aggregate both:
SELECT name, age, ad.streets, wd.days
FROM employees e
JOIN (
SELECT employeeid, array_agg(ad.street) AS streets
FROM address
GROUP BY 1
) ad ON ad.employeeid = e.id
JOIN (
SELECT employeeid, arrag_agg(wd.day) AS days
FROM workingdays
GROUP BY 1
) wd ON wd.employeeid = e.id;
The last one is typically faster if you retrieve all or most of the rows in the base tables.
Note that using JOIN
and not LEFT JOIN
removes employees from the result who have no address or no workingdays. That may or may not be intended. Switch to LEFT JOIN
to retain all employees in the result.
Correlated subqueries / LATERAL join
For a small selection, I would consider correlated subqueries instead:
SELECT name, age
, (SELECT array_agg(street) FROM address WHERE employeeid = e.id) AS streets
, (SELECT arrag_agg(day) FROM workingdays WHERE employeeid = e.id) AS days
FROM employees e
WHERE e.namer = 'peter'; -- very selective
Or, with Postgres 9.3 or later, you can use LATERAL
joins for that:
SELECT e.name, e.age, a.streets, w.days
FROM employees e
LEFT JOIN LATERAL (
SELECT array_agg(street) AS streets
FROM address
WHERE employeeid = e.id
GROUP BY 1
) a ON true
LEFT JOIN LATERAL (
SELECT array_agg(day) AS days
FROM workingdays
WHERE employeeid = e.id
GROUP BY 1
) w ON true
WHERE e.name = 'peter'; -- very selective
- What is the difference between LATERAL and a subquery in PostgreSQL?
Either query retains all employees in the result.
Solution 2:
Whenever you need values that aren't repeated, use DISTINCT, like so:
SELECT name, age, array_agg(DISTINCT ad.street), array_agg(DISTINCT wd.day) FROM employees e
JOIN address ad ON e.id = ad.employeeid
JOIN workingdays wd ON e.id = wd.employeeid
GROUP BY name, age