Merge overlapping date intervals

Is there a better way of merging overlapping date intervals?
The solution I came up with is so simple that now I wonder if someone else has a better idea of how this could be done.

/***** DATA EXAMPLE *****/
DECLARE @T TABLE (d1 DATETIME, d2 DATETIME)
INSERT INTO @T (d1, d2)
        SELECT '2010-01-01','2010-03-31' UNION SELECT '2010-04-01','2010-05-31' 
  UNION SELECT '2010-06-15','2010-06-25' UNION SELECT '2010-06-26','2010-07-10' 
  UNION SELECT '2010-08-01','2010-08-05' UNION SELECT '2010-08-01','2010-08-09' 
  UNION SELECT '2010-08-02','2010-08-07' UNION SELECT '2010-08-08','2010-08-08' 
  UNION SELECT '2010-08-09','2010-08-12' UNION SELECT '2010-07-04','2010-08-16' 
  UNION SELECT '2010-11-01','2010-12-31' UNION SELECT '2010-03-01','2010-06-13' 

/***** INTERVAL ANALYSIS *****/
WHILE (1=1)  BEGIN
  UPDATE t1 SET t1.d2 = t2.d2
  FROM @T AS t1 INNER JOIN @T AS t2 ON 
            DATEADD(day, 1, t1.d2) BETWEEN t2.d1 AND t2.d2 
  IF @@ROWCOUNT = 0 BREAK
END

/***** RESULT *****/
SELECT StartDate = MIN(d1) , EndDate = d2
FROM @T
GROUP BY d2
ORDER BY StartDate, EndDate

/***** OUTPUT *****/
/*****
StartDate   EndDate
2010-01-01  2010-06-13 
2010-06-15  2010-08-16 
2010-11-01  2010-12-31 
*****/

I was looking for the same solution and came across this post on Combine overlapping datetime to return single overlapping range record.

There is another thread on Packing Date Intervals.

I tested this with various date ranges, including the ones listed here, and it works correctly every time.

SELECT 
       s1.StartDate,
       --t1.EndDate 
       MIN(t1.EndDate) AS EndDate
FROM @T s1 
INNER JOIN @T t1 ON s1.StartDate <= t1.EndDate
  AND NOT EXISTS(SELECT * FROM @T t2 
                 WHERE t1.EndDate >= t2.StartDate AND t1.EndDate < t2.EndDate) 
WHERE NOT EXISTS(SELECT * FROM @T s2 
                 WHERE s1.StartDate > s2.StartDate AND s1.StartDate <= s2.EndDate) 
GROUP BY s1.StartDate 
ORDER BY s1.StartDate

The result is:

StartDate  | EndDate
2010-01-01 | 2010-06-13
2010-06-15 | 2010-06-25
2010-06-26 | 2010-08-16
2010-11-01 | 2010-12-31

You asked this back in 2010 but don't specify any particular version.

An answer for people on SQL Server 2012+

WITH T1
     AS (SELECT *,
                MAX(d2) OVER (ORDER BY d1) AS max_d2_so_far
         FROM   @T),
     T2
     AS (SELECT *,
                CASE
                  WHEN d1 <= DATEADD(DAY, 1, LAG(max_d2_so_far) OVER (ORDER BY d1))
                    THEN 0
                  ELSE 1
                END AS range_start
         FROM   T1),
     T3
     AS (SELECT *,
                SUM(range_start) OVER (ORDER BY d1) AS range_group
         FROM   T2)
SELECT range_group,
       MIN(d1) AS d1,
       MAX(d2) AS d2
FROM   T3
GROUP  BY range_group

Which returns

+-------------+------------+------------+
| range_group |     d1     |     d2     |
+-------------+------------+------------+
|           1 | 2010-01-01 | 2010-06-13 |
|           2 | 2010-06-15 | 2010-08-16 |
|           3 | 2010-11-01 | 2010-12-31 |
+-------------+------------+------------+

DATEADD(DAY, 1 is used because your desired results show you want a period ending on 2010-06-25 to be collapsed into one starting 2010-06-26. For other use cases this may need adjusting.

Here is a solution with just three simple scans. No CTEs, no recursion, no joins, no table updates in a loop, no "group by" — as a result, this solution should scale the best (I think). I think number of scans can be reduced to two, if min and max dates are known in advance; the logic itself just needs two scans — find gaps, applied twice.

declare @datefrom datetime, @datethru datetime

DECLARE @T TABLE (d1 DATETIME, d2 DATETIME)

INSERT INTO @T (d1, d2)

SELECT '2010-01-01','2010-03-31' 
UNION SELECT '2010-03-01','2010-06-13' 
UNION SELECT '2010-04-01','2010-05-31' 
UNION SELECT '2010-06-15','2010-06-25' 
UNION SELECT '2010-06-26','2010-07-10' 
UNION SELECT '2010-08-01','2010-08-05' 
UNION SELECT '2010-08-01','2010-08-09' 
UNION SELECT '2010-08-02','2010-08-07' 
UNION SELECT '2010-08-08','2010-08-08' 
UNION SELECT '2010-08-09','2010-08-12' 
UNION SELECT '2010-07-04','2010-08-16' 
UNION SELECT '2010-11-01','2010-12-31' 

select @datefrom = min(d1) - 1, @datethru = max(d2) + 1 from @t

SELECT 
StartDate, EndDate
FROM
(
    SELECT 
    MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
    LEAD(StartDate ) OVER (ORDER BY StartDate) - 1 EndDate
    FROM
    (
        SELECT 
        StartDate, EndDate
        FROM
        (
            SELECT 
            MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
            LEAD(StartDate) OVER (ORDER BY StartDate) - 1 EndDate 
            FROM 
            (
                SELECT d1 StartDate, d2 EndDate from @T 
                UNION ALL 
                SELECT @datefrom StartDate, @datefrom EndDate 
                UNION ALL 
                SELECT @datethru StartDate, @datethru EndDate
            ) T
        ) T
        WHERE StartDate <= EndDate
        UNION ALL 
        SELECT @datefrom StartDate, @datefrom EndDate 
        UNION ALL 
        SELECT @datethru StartDate, @datethru EndDate
    ) T
) T
WHERE StartDate <= EndDate

The result is:

StartDate   EndDate
2010-01-01  2010-06-13
2010-06-15  2010-08-16
2010-11-01  2010-12-31

The idea is to simulate the scanning algorithm for merging intervals. My solution makes sure it works across a wide range of SQL implementations. I've tested it on MySQL, Postgres, SQL-Server 2017, SQLite and even Hive.

Assuming the table schema is the following.

CREATE TABLE t (
  a DATETIME,
  b DATETIME
);

We also assume the interval is half-open like [a,b).

When (a,i,j) is in the table, it shows that there are j intervals covering a, and there are i intervals covering the previous point.

CREATE VIEW r AS 
SELECT a,
       Sum(d) OVER (ORDER BY a ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS i,
       Sum(d) OVER (ORDER BY a ROWS UNBOUNDED PRECEDING) AS j
FROM  (SELECT a, Sum(d) AS d
       FROM   (SELECT a,  1 AS d FROM t
               UNION ALL
               SELECT b, -1 AS d FROM t) e
       GROUP  BY a) f;

We produce all the endpoints in the union of the intervals and pair up adjacent ones. Finally, we produce the set of intervals by only picking the odd-numbered rows.

SELECT a, b
FROM (SELECT a,
             Lead(a)      OVER (ORDER BY a) AS b,
             Row_number() OVER (ORDER BY a) AS n
      FROM   r
      WHERE  j=0 OR i=0 OR i is null) e
WHERE  n%2 = 1;

I've created a sample DB-fiddle and SQL-fiddle. I also wrote a blog post on union intervals in SQL.

A Geometric Approach

Here and elsewhere I've noticed that date packing questions don't provide a geometric approach to this problem. After all, any range, date-ranges included, can be interpreted as a line. So why not convert them to a sql geometry type and utilize geometry::UnionAggregate to merge the ranges.

Why?

This has the advantage of handling all types of overlaps, including fully nested ranges. It also works like any other aggregate query, so it's a little more intuitive in that respect. You also get the bonus of a visual representation of your results if you care to use it. Finally, it is the approach I use for simultaneous range packing (you work with rectangles instead of lines in that case, and there are many more considerations). I just couldn't get the existing approaches to work in that scenario.

This has the disadvantage of requiring more recent versions of SQL Server. It also requires a numbers table and it's annoying to extract the individually produced lines from the aggregated shape. But hopefully in the future Microsoft adds a TVF that allows you to do this easily without a numbers table (or you can just build one yourself). Also, geometrical objects work with floats, so you have conversion annoyances and precision concerns to keep in mind.

Performance-wise I don't know how it compares, but I've done a few things (not shown here) to make it work for me even with large datasets.

Code Description

In 'numbers':

I build a table representing a sequence
Swap it out with your favorite way to make a numbers table.
For a union operation, you won't ever need more rows than in your original table, so I just use it as the base to build it.

In 'mergeLines':

I convert the dates to floats and use those floats to create geometrical points.
In this problem, we're working in 'integer space,' meaning there are no time considerations, and so an begin date in one range that is one day apart from an end date in another should be merged with that other. In order to make that merge happen, we need to convert to 'real space.', so we add 1 to the tail of all ranges (we undo this later).
I then connect these points via STUnion and STEnvelope.
Finally, I merge all these lines via UnionAggregate. The resulting 'lines' geometry object might contain multiple lines, but if they overlap, they turn into one line.

In the outer query:

I use the numbers CTE to extract the individual lines inside 'lines'.
I envelope the lines which here ensures that the lines are stored only as its two endpoints.
I read the endpoint x values and convert them back to their time representations, ensuring to put them back into 'integer space'.

The Code

with 

    numbers as (

        select  row_number() over (order by (select null)) i 
        from    @t

    ),

    mergeLines as (

        select      lines = geometry::UnionAggregate(line)
        from        @t
        cross apply (select line = 
                        geometry::Point(convert(float, d1), 0, 0).STUnion(
                            geometry::Point(convert(float, d2) + 1, 0, 0)
                        ).STEnvelope()
                    ) l

    )

    select      ap.StartDate,
                ap.EndDate
    from        mergeLines ml
    join        numbers n on n.i between 1 and ml.lines.STNumGeometries()
    cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
    cross apply (select 
                    StartDate = convert(datetime,l.line.STPointN(1).STX),
                    EndDate = convert(datetime,l.line.STPointN(3).STX) - 1
                ) ap
    order by    ap.StartDate;

Merge overlapping date intervals

A Geometric Approach

Why?

Code Description

The Code

Related

Recent Posts