GROUP BY and aggregate sequential numeric values
Solution 1:
Identifying non-consecutive values is always a bit tricky and involves several nested sub-queries (at least I cannot come up with a better solution).
The first step is to identify non-consecutive values for the year:
Step 1) Identify non-consecutive values
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
This returns the following result:
company | profession | year | group_cnt ---------+------------+------+----------- Google | Programmer | 2000 | 1 Google | Sales | 2000 | 1 Google | Sales | 2001 | 0 Google | Sales | 2002 | 0 Google | Sales | 2004 | 1 Mozilla | Sales | 2002 | 1
Now with the group_cnt value we can create "group IDs" for each group that has consecutive years:
Step 2) Define group IDs
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
This returns the following result:
company | profession | year | group_nr ---------+------------+------+---------- Google | Programmer | 2000 | 1 Google | Sales | 2000 | 2 Google | Sales | 2001 | 2 Google | Sales | 2002 | 2 Google | Sales | 2004 | 3 Mozilla | Sales | 2002 | 4 (6 rows)
As you can see each "group" got its own group_nr and this we can finally use to aggregate over by adding yet another derived table:
Step 3) Final query
select company,
profession,
array_agg(year) as years
from (
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr
This returns the following result:
company | profession | years ---------+------------+------------------ Google | Programmer | {2000} Google | Sales | {2000,2001,2002} Google | Sales | {2004} Mozilla | Sales | {2002} (4 rows)
Which is exactly what you wanted, if I'm not mistaken.
Solution 2:
There's much value to @a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.
And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company
and profession
, now you only need something to distinguish years that belong to different sequences.
That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:
WITH MarkedForGrouping AS (
SELECT
company,
profession,
year,
year - ROW_NUMBER() OVER (
PARTITION BY company, profession
ORDER BY year
) AS seqID
FROM atable
)
SELECT
company,
profession,
array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
company,
profession,
seqID
Solution 3:
Procedural solution with PL/pgSQL
The problem is rather unwieldy for plain SQL with aggregate / windows functions. While looping is typically slower than set-based solutions with plain SQL, a procedural solution with plpgsql can make do with a single sequential scan over the table (implicit cursor of a FOR
loop) and should be substantially faster in this particular case:
Test table:
CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
('Google', 'Programmer', 2000)
,('Google', 'Sales', 2000)
,('Google', 'Sales', 2001)
,('Google', 'Sales', 2002)
,('Google', 'Sales', 2004)
,('Mozilla', 'Sales', 2002);
Function:
CREATE OR REPLACE FUNCTION f_periods()
RETURNS TABLE (company text, profession text, years int[]) AS
$func$
DECLARE
r tbl; -- use table type as row variable
r0 tbl;
BEGIN
FOR r IN
SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
LOOP
IF ( r.company, r.profession, r.year)
<> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row
RETURN QUERY
SELECT r0.company, r0.profession, years; -- output row
years := ARRAY[r.year]; -- start new array
ELSE
years := years || r.year; -- add to array - year can be NULL, too
END IF;
r0 := r; -- remember last row
END LOOP;
RETURN QUERY -- output last iteration
SELECT r0.company, r0.profession, years;
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_periods();
Produces the requested result.