Finding similar strings with PostgreSQL quickly
Solution 1:
The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.
Use SET pg_trgm.similarity_threshold
and the %
operator instead. Both are provided by the pg_trgm
module. This way, a trigram GiST index can be used to great effect.
The configuration parameter pg_trgm.similarity_threshold
replaced the functions set_limit()
and show_limit()
in Postgres 9.6. The deprecated functions still work (as of Postgres 13). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.
Try instead:
SET pg_trgm.similarity_threshold = 0.8; -- Postgres 9.6 or later
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM names n1
JOIN names n2 ON n1.name <> n2.name
AND n1.name % n2.name
ORDER BY sim DESC;
Faster by orders of magnitude, but still slow.
pg_trgm.similarity_threshold
is a "customized" option, which can be handled like any other option. See:
- Query a parameter (postgresql.conf setting) like "max_connections"
You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).
This does not work because you cannot refer to output columns in WHERE
or HAVING
clauses:
WHERE ... sim > 0.8
That's according to the SQL standard (which is handled rather loosely by certain other RDBMS). On the other hand:
ORDER BY sim DESC
Works because output columns can be used in GROUP BY
and ORDER BY
. See:
- PostgreSQL reusing computation result in select query
Test case
I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE
(best of 5).
CREATE TEMP table t AS
SELECT some_col AS name FROM some_table LIMIT 1000; -- real life test strings
First round of tests with GIN index:
CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops); -- round1: with GIN index
Second round of tests with GIST index:
DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);
New query:
SELECT set_limit(0.8);
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM t n1
JOIN t n2 ON n1.name <> n2.name
AND n1.name % n2.name
ORDER BY sim DESC;
GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms
Old query:
SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM t n1, t n2
WHERE n1.name != n2.name
AND similarity(n1.name, n2.name) > 0.8
ORDER BY sim DESC;
GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms
Otherwise identical results. Advice is good. And this is for just 1000 rows!
GIN or GiST?
GIN often provides superior read performance:
- Difference between GiST and GIN index
But not in this particular case!
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.
- Multicolumn index on 3 fields with heterogenous data types