Replacement for materialized view on PostgreSQL

postgresql

I have a table with three columns: creationTime, number, id. That has been populated every 15 seconds or so. I have been using materialized view to track duplicates like so:

 SELECT number, id, count(*) AS dups_count
 FROM my_table
 GROUP BY number, id HAVING count(*) > 1;

The table contains thousands of records for the last 1.5 years. Refreshing this materialized view takes at this point about 2 minutes. I would like to have a better solution to this. There is no quick refresh materialized views available for PostgreSQL.

At first, I thought creating a trigger for the table to refresh materialized view could be a solution. But if I have records come in every 15 seconds and it takes materialized view over 2 minutes to recalculate, it would not be a good idea. Anyways, I wouldn't say I like the idea of recalculating the same data over and over again.

Is there a better solution to it?

A trigger the increments duplicate count might be a solution:

create table duplicates
(
  number int, 
  id int, 
  dups_count int,
  primary key (number, id)
);

The primary key will allow an efficient "UPSERT" that increments the dups_count in case of duplicates.

Then create a trigger that updates that table each time a row is inserted into the base table:

create function increment_dupes()
  returns trigger
as
$$
begin
   insert into duplicates (number, id, dups_count)
   values (new.number, new.id, 1)
   on conflict (number,id)
   do update 
     set dups_count = duplicates.dups_count + 1;
   return new;
end
$$
language plpgsql;

create trigger update_dups_count
  after insert on my_table
  for each row
  execute function increment_dupes();

Each time you insert into my_table either a new row will be created in duplicates, or the current dups_count will be incremented. If you delete or update rows from my_table you will also need a trigger for that. However updating the count for UPDATEs or DELETEs is not entirely safe for concurrent operations. The INSERT ON CONFLICT is however.

A trigger does have some performance overhead, so you will have to test if the overhead is too big for your requirements.

Whenever there is a scope of growth , the best way to scale is to find a way to repeat a process on incremental data.
To explain this , we name the table that has been mentioned as 'Tab':

Tab  
Number  ID  CreationTime
Index on creationtime column.

Key to applying the incremental method is to have a monotonically increasing value. Here we have 'creationtime' for that.

(a) create another table Tab_duplicate with an additional column 'last_compute_timestamp' Say:

Tab_duplicate
Number ID Duplicate_count last_compute_timestamp

(b) Create an index on column 'last_compute_timestamp'.

(d) For repeat Execution:

install extension pg_cron (if its not there) and automate this execution of insert. https://github.com/citusdata/pg_cron

https://fatdba.com/2021/07/30/pg_cron-probably-the-best-way-to-schedule-jobs-within-postgresql-database/

or 2. Use a shell script/python script to execute it on the DB through OS crontab.

The fact that last_compute_timestamp is recorded in every iteration and reused next , it will be incremental and always be fast.

DEMONSTRATION:

Step 1: Production table

create table tab
(
 id int,
 number int,
 creationtime timestamp
 );
create index tab_id on tab(creationtime);

Step 2: Duplicate capture table , with one time priming record(this can be removed after the first execution)

create table tab_duplicate
(
 id int,
 number int,
 duplicate_count int,
 last_compute_timestamp timestamp);
 create index tab_duplicate_idx on tab_duplicate(last_compute_timestamp);
 insert into tab_duplicate values(0,0,0,current_timestamp);

Step 3: Some duplicate entry into the production table

insert into tab values(1,10,current_timestamp);
 select pg_sleep(1);
 insert into tab values(1,10,current_timestamp);
 insert into tab values(1,10,current_timestamp);
 select pg_sleep(1);
 insert into tab values(2,20,current_timestamp);
 select pg_sleep(1);
 insert into tab values(2,20,current_timestamp);
 select pg_sleep(1);
 insert into tab values(3,30,current_timestamp);
 insert into tab values(3,30,current_timestamp);
 select pg_sleep(1);
 insert into tab values(4,40,current_timestamp);

Verify records:

postgres=# select * from tab;
 id | number |        creationtime
----+--------+----------------------------
  1 |     10 | 2022-01-23 19:00:37.238865
  1 |     10 | 2022-01-23 19:00:38.248574
  1 |     10 | 2022-01-23 19:00:38.252622
  2 |     20 | 2022-01-23 19:00:39.259584
  2 |     20 | 2022-01-23 19:00:40.26655
  3 |     30 | 2022-01-23 19:00:41.274673
  3 |     30 | 2022-01-23 19:00:41.279298
  4 |     40 | 2022-01-23 19:00:52.697257
(8 rows)

Step 4: Duplicates captured and verified.

INSERT INTO tab_duplicate
SELECT a.id,
       a.number,
       a.duplicate_count,
       b.last_compute_timestamp
FROM   (SELECT id,
               number,
               Count(*) duplicate_count
        FROM   tab,
               (SELECT Max(last_compute_timestamp) lct
                FROM   tab_duplicate) max_date
        WHERE  creationtime > max_date.lct
        GROUP  BY id,
                  number) a,
       (SELECT Max(creationtime) last_compute_timestamp
        FROM   tab,
               (SELECT Max(last_compute_timestamp) lct
                FROM   tab_duplicate) max_date
        WHERE  creationtime > max_date.lct) b;

Execute:

postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-#        a.number,
postgres-#        a.duplicate_count,
postgres-#        b.last_compute_timestamp
postgres-# FROM   (SELECT id,
postgres(#                number,
postgres(#                Count(*) duplicate_count
postgres(#         FROM   tab,
postgres(#                (SELECT Max(last_compute_timestamp) lct
postgres(#                 FROM   tab_duplicate) max_date
postgres(#         WHERE  creationtime > max_date.lct
postgres(#         GROUP  BY id,
postgres(#                   number) a,
postgres-#        (SELECT Max(creationtime) last_compute_timestamp
postgres(#         FROM   tab,
postgres(#                (SELECT Max(last_compute_timestamp) lct
postgres(#                 FROM   tab_duplicate) max_date
postgres(#         WHERE  creationtime > max_date.lct) b;
INSERT 0 4
postgres=#

Verify:

postgres=# select * from tab_duplicate;
 id | number | duplicate_count |   last_compute_timestamp
----+--------+-----------------+----------------------------
  0 |      0 |               0 | 2022-01-23 19:00:25.779671
  3 |     30 |               2 | 2022-01-23 19:00:52.697257
  1 |     10 |               3 | 2022-01-23 19:00:52.697257
  4 |     40 |               1 | 2022-01-23 19:00:52.697257
  2 |     20 |               2 | 2022-01-23 19:00:52.697257
(5 rows)

Step 5: Some more duplicates into the production table

insert into tab values(5,50,current_timestamp);
 select pg_sleep(1);
 insert into tab values(5,50,current_timestamp);
 select pg_sleep(1);
 insert into tab values(5,50,current_timestamp);
 select pg_sleep(1);
 insert into tab values(6,60,current_timestamp);
 select pg_sleep(1);
 insert into tab values(6,60,current_timestamp);
 select pg_sleep(1);

Step 6: Same duplicate capture SQL executed will CAPTURE ONLY the incremental records in the production table.


INSERT INTO tab_duplicate
SELECT a.id,
       a.number,
       a.duplicate_count,
       b.last_compute_timestamp
FROM   (SELECT id,
               number,
               Count(*) duplicate_count
        FROM   tab,
               (SELECT Max(last_compute_timestamp) lct
                FROM   tab_duplicate) max_date
        WHERE  creationtime > max_date.lct
        GROUP  BY id,
                  number) a,
       (SELECT Max(creationtime) last_compute_timestamp
        FROM   tab,
               (SELECT Max(last_compute_timestamp) lct
                FROM   tab_duplicate) max_date
        WHERE  creationtime > max_date.lct) b;

Execute:

postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-#        a.number,
postgres-#        a.duplicate_count,
postgres-#        b.last_compute_timestamp
postgres-# FROM   (SELECT id,
postgres(#                number,
postgres(#                Count(*) duplicate_count
postgres(#         FROM   tab,
postgres(#                (SELECT Max(last_compute_timestamp) lct
postgres(#                 FROM   tab_duplicate) max_date
postgres(#         WHERE  creationtime > max_date.lct
postgres(#         GROUP  BY id,
postgres(#                   number) a,
postgres-#        (SELECT Max(creationtime) last_compute_timestamp
postgres(#         FROM   tab,
postgres(#                (SELECT Max(last_compute_timestamp) lct
postgres(#                 FROM   tab_duplicate) max_date
postgres(#         WHERE  creationtime > max_date.lct) b;
INSERT 0 2

Verify:

postgres=# select * from tab_duplicate;
 id | number | duplicate_count |   last_compute_timestamp
----+--------+-----------------+----------------------------
  0 |      0 |               0 | 2022-01-23 19:00:25.779671
  3 |     30 |               2 | 2022-01-23 19:00:52.697257
  1 |     10 |               3 | 2022-01-23 19:00:52.697257
  4 |     40 |               1 | 2022-01-23 19:00:52.697257
  2 |     20 |               2 | 2022-01-23 19:00:52.697257
  5 |     50 |               3 | 2022-01-23 19:02:37.884417
  6 |     60 |               2 | 2022-01-23 19:02:37.884417
(7 rows)

This duplicate capture will be always fast because of two things:

It works only on incremental data of last whatever duration you schedule it.
Scanning of the table to find the maximum timestamp happens on a single column index (index only scan).

From execution plan:

->  Index Only Scan Backward using tab_duplicate_idx on tab_duplicate tab_duplicate_2  (cost=0.15..77.76 rows=1692 width=8)

CAVEAT : In case, if you have duplicates over longer period of time in table tab_duplicate , you can dedupe records in TAB_DUPLICATION at a periodic duration , say at the end of the day which will anyways be fast because TAB_DUPLICATE is anyway an aggregated small table and the table is OFFLINE to your application whereas TAB is your production table with huge accumulated data.

Also , a trigger on the production table is a viable solution but that adds overhead to transactions on the production as trigger execution has a cost for every insert.

Two approaches come to mind:

Create a secondary table with (number, id) columns. Add a trigger so that whenever a duplicate row is about to be inserted into my_table, it is also inserted into this secondary table. That way you'll have the data you need in the secondary table as soon as it comes in, and it won't take up too much space unless you have a ton of these duplicates.
Add a new column to my_table, perhaps a timestamp, to differentiate the duplicates. Add a unique constraint to my_table over the (number, id) columns where the new column is null. Then, you can change your insert to include an ON CONFLICT clause, so that if a duplicate is being inserted you set its timestamp to now. When you want to search for duplicates, you can then just query using the new column.

Replacement for materialized view on PostgreSQL

Related

Recent Posts