Bad performance for a large join query in PostgreSQL 8.4.4, despite enough memory to cache entirely

Question

How can I make the query described in this post faster, in particular by making PostgreSQL using the available RAM? - Note that I have tried to configure effective_cache_size and shared_buffers appropriately. See below.

Background

I have to regularly join a ~260 million row table (coreg_master) with new data that comes in. I've partitioned the table to allow each partition to fit in RAM. I do have appropriate indexes set up too of course. However, when joining the partitioned tables separately against the other (much smaller) tables, it's doing fully random IO on the disk. This is due to a nested loop index scan on the large table, which is really slow as we don't have a great disk setup.

I'd expect it to use all available RAM to cache the large partitioned table instead, which I understand should be done by the Linux kernel/filesystem itself. But it still doesn't load the table into RAM, though it would fit. I guess that's because the access pattern isn't sequential and thus doesn't trigger caching? I have no idea. Query plan and configuration parameters below.

Table Structure

This is one of the partitions of coreg_master, my large table. The partitioned tables are named coreg_a, coreg_b, etc.

\d coreg_a
                                   Table "public.coreg_a"
   Column    |       Type        |                         Modifiers                         
-------------+-------------------+-----------------------------------------------------------
 id          | integer           | not null default nextval('coreg_master_id_seq'::regclass)
 first_name  | character varying | 
 last_name   | character varying | 
 phone       | character varying | 
 city        | character varying | 
 zip         | integer           | 
 address     | character varying | 
 dob         | date              | 
 ip          | character varying | 
 source      | character varying | 
 gender      | character varying | 
 state       | character varying | 
 record_date | date              | 
 email       | character varying | 
Indexes:
    "coreg_a_name" btree (lower(first_name::text), lower(last_name::text))
Check constraints:
    "coreg_a_first_name_check" CHECK (first_name::text >= 'a'::text AND first_name::text < 'b'::text)
Inherits: coreg_master

The following is a partition of table appendable_24, an example of the table being joined with coreg_master. It is also partitioned the same way as coreg_master, so actually coreg_a is joined with appendable_24_a, etc, one at a time.

\d appendable_24_a
       Table "public.appendable_24_a"
   Column   |       Type        | Modifiers 
------------+-------------------+-----------
 line_num   | integer           | not null
 first_name | character varying | 
 last_name  | character varying | 
 address    | character varying | 
 state      | character varying | 
 zip        | integer           | 
Indexes:
    "appendable_24_a_name_index" btree (lower(first_name::text), lower(last_name::text))
Check constraints:
    "appendable_24_a_first_name_check" CHECK (first_name::text >= 'a'::text AND first_name::text < 'b'::text)
Inherits: appendable_24

Query & EXPLAIN ANALYZE

Below is the output of explain analyze for the smallest of the joins (tables are partitioned according to the first letter of the first_name column), as it doesn't take ages. However, the query plan is the same for all the joins on each partition, so it should be representative of the larger joins as well (note, i did ANALYZE the tables, and the total time was actually 20 seconds, but faster here as the result was cached):

explain analyze SELECT
          coreg_x.phone,
          coreg_x.email,
          coreg_x.record_date,
          appendable_24_x.line_num
        FROM appendable_24_x INNER JOIN coreg_x ON
          lower(appendable_24_x.first_name) = lower(coreg_x.first_name) AND
          lower(appendable_24_x.last_name) = lower(coreg_x.last_name) AND
          (coreg_x.phone IS NOT NULL OR coreg_x.email IS NOT NULL) AND
          similarity(lower(appendable_24_x.address), lower(coreg_x.address)) > 0.7
      ; 
                                                                                       QUE
RY PLAN 

------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
-----
 Nested Loop  (cost=0.01..640.49 rows=875 width=39) (actual time=9.990..53.839 rows=29 loo
ps=1)
   Join Filter: (similarity(lower((appendable_24_x.address)::text), lower((coreg_x.address
)::text)) > 0.7::double precision)
   ->  Seq Scan on appendable_24_x  (cost=0.00..1.80 rows=80 width=34) (actual time=0.009.
.0.111 rows=80 loops=1)
   ->  Index Scan using coreg_x_name on coreg_x  (cost=0.01..7.95 rows=1 width=64) (actual
 time=0.024..0.137 rows=44 loops=80)
         Index Cond: ((lower((coreg_x.first_name)::text) = lower((appendable_24_x.first_na
me)::text)) AND (lower((coreg_x.last_name)::text) = lower((appendable_24_x.last_name)::tex
t)))
         Filter: ((coreg_x.phone IS NOT NULL) OR (coreg_x.email IS NOT NULL))
 Total runtime: 53.950 ms
(7 rows)

Some stats, configuration parameters and other data

  • PostgreSQL Version: 8.4.4
  • OS: CentOS release 5.5 (Final)
  • Filesystem: ext3
  • Total available RAM: 8GB
  • shared_buffers = 2GB
  • effective_cache_size = 7200MB
  • full runtime configuration through show all: http://pastie.org/1159746
  • largest partitioned coreg table (coreg_j) size: ~4900MB
  • corresponding number of rows: ~32 million
  • corresponding (first_name, last_name) index size: ~1000MB
  • second partitioned table (appendable_24_j) size: ~1800kB
  • rows in appendable_24_j: ~25,000

Solution 1:

There's only enough ram to cache a fraction of your db, and as the query plan you posted shows, previously accessed data/portions of the index in question are indeed cached. (Postgres does not cache query results).

53 ms runtime isn't too shabby, and I'm not sure 20 sec on uncached data means that PG picked a bad query plan either. After all, just the index in question is 1GB large, but it would be interesting to see analyze output for a slow query.

You could try adjusting planner costs to see if that has any effect on worst-case performance, if that's your issue.

You might also want to increase your maintentance_work_mem a bit, even if that's in no way related.