MySQL - SELECT WHERE field IN (subquery) - Extremely slow why?
I've got a couple of duplicates in a database that I want to inspect, so what I did to see which are duplicates, I did this:
SELECT relevant_field
FROM some_table
GROUP BY relevant_field
HAVING COUNT(*) > 1
This way, I will get all rows with relevant_field occuring more than once. This query takes milliseconds to execute.
Now, I wanted to inspect each of the duplicates, so I thought I could SELECT each row in some_table with a relevant_field in the above query, so I did like this:
SELECT *
FROM some_table
WHERE relevant_field IN
(
SELECT relevant_field
FROM some_table
GROUP BY relevant_field
HAVING COUNT(*) > 1
)
This turns out to be extreeeemely slow for some reason (it takes minutes). What exactly is going on here to make it that slow? relevant_field is indexed.
Eventually I tried creating a view "temp_view" from the first query (SELECT relevant_field FROM some_table GROUP BY relevant_field HAVING COUNT(*) > 1)
, and then making my second query like this instead:
SELECT *
FROM some_table
WHERE relevant_field IN
(
SELECT relevant_field
FROM temp_view
)
And that works just fine. MySQL does this in some milliseconds.
Any SQL experts here who can explain what's going on?
The subquery is being run for each row because it is a correlated query. One can make a correlated query into a non-correlated query by selecting everything from the subquery, like so:
SELECT * FROM
(
SELECT relevant_field
FROM some_table
GROUP BY relevant_field
HAVING COUNT(*) > 1
) AS subquery
The final query would look like this:
SELECT *
FROM some_table
WHERE relevant_field IN
(
SELECT * FROM
(
SELECT relevant_field
FROM some_table
GROUP BY relevant_field
HAVING COUNT(*) > 1
) AS subquery
)
Rewrite the query into this
SELECT st1.*, st2.relevant_field FROM sometable st1
INNER JOIN sometable st2 ON (st1.relevant_field = st2.relevant_field)
GROUP BY st1.id /* list a unique sometable field here*/
HAVING COUNT(*) > 1
I think st2.relevant_field
must be in the select, because otherwise the having
clause will give an error, but I'm not 100% sure
Never use IN
with a subquery; this is notoriously slow.
Only ever use IN
with a fixed list of values.
More tips
- If you want to make queries faster,
don't do a
SELECT *
only select the fields that you really need. - Make sure you have an index on
relevant_field
to speed up the equi-join. - Make sure to
group by
on the primary key. - If you are on InnoDB and you only select indexed fields (and things are not too complex) than MySQL will resolve your query using only the indexes, speeding things way up.
General solution for 90% of your IN (select
queries
Use this code
SELECT * FROM sometable a WHERE EXISTS (
SELECT 1 FROM sometable b
WHERE a.relevant_field = b.relevant_field
GROUP BY b.relevant_field
HAVING count(*) > 1)
SELECT st1.*
FROM some_table st1
inner join
(
SELECT relevant_field
FROM some_table
GROUP BY relevant_field
HAVING COUNT(*) > 1
)st2 on st2.relevant_field = st1.relevant_field;
I've tried your query on one of my databases, and also tried it rewritten as a join to a sub-query.
This worked a lot faster, try it!
I have reformatted your slow sql query with www.prettysql.net
SELECT *
FROM some_table
WHERE
relevant_field in
(
SELECT relevant_field
FROM some_table
GROUP BY relevant_field
HAVING COUNT ( * ) > 1
);
When using a table in both the query and the subquery, you should always alias both, like this:
SELECT *
FROM some_table as t1
WHERE
t1.relevant_field in
(
SELECT t2.relevant_field
FROM some_table as t2
GROUP BY t2.relevant_field
HAVING COUNT ( t2.relevant_field ) > 1
);
Does that help?