Speeding up row counting in MySQL

Suppose, for illustrative purposes, you are running a library using a simple MySQL "books" table with three columns:

(id, title, status)

  • id is the primary key
  • title is the title of the book
  • status could be an enum describing the book's current state (e.g. AVAILABLE, CHECKEDOUT, PROCESSING, MISSING)

A simple query to report how many books fall into each state is:

SELECT status, COUNT(*) FROM books GROUP BY status

or to specifically find how many books are available:

SELECT COUNT(*) FROM books WHERE status = "AVAILABLE"

However, once the table grows to millions of rows, these queries take several seconds to complete. Adding an index to the "status" column doesn't appear to make a difference in my experience.

Aside from periodically caching the results or explicitly updating summary info in a separate table each time a book changes state (via triggers or some other mechanism), are there any techniques for speeding up these kinds of queries? It seems that the COUNT queries end up looking at every row, and (without knowing more details) I'm a bit surprised that this information can't somehow be determined from the index.

UPDATE

Using the sample table (with an indexed "status" column) with 2 million rows, I benchmarked the GROUP BY query. Using the InnoDB storage engine, the query takes 3.0 - 3.2 seconds on my machine. Using MyISAM, the query takes 0.9 - 1.1 seconds. There was no significant difference between count(*), count(status), or count(1) in either case.

MyISAM is admittedly a bit faster, but I was curious to see if there was a way to make an equivalent query run much faster (e.g. 10-50 ms -- fast enough to be called on every webpage request for a low-traffic site) without the mental overhead of caching and triggers. It sounds like the answer is "there's no way to run the direct query quickly" which is what I expected - I just wanted to make sure I wasn't missing an easy alternative.


So the question is

are there any techniques for speeding up these kinds of queries?

Well, not really. A column-based storage engine would probably be faster with those SELECT COUNT(*) queries but it would be less performant for pretty much any other query.

Your best bet is to maintain a summary table via triggers. It doesn't have much overhead and the SELECT part will be instantaneous no matter how big the table. Here's some boilerplate code:

DELIMITER //

CREATE TRIGGER ai_books AFTER INSERT ON books
FOR EACH ROW UPDATE books_cnt SET total = total + 1 WHERE status = NEW.status
//
CREATE TRIGGER ad_books AFTER DELETE ON books
FOR EACH ROW UPDATE books_cnt SET total = total - 1 WHERE status = OLD.status;
//
CREATE TRIGGER au_books AFTER UPDATE ON books
FOR EACH ROW
BEGIN
    IF (OLD.status <> NEW.status)
    THEN
        UPDATE books_cnt SET total = total + IF(status = NEW.status, 1, -1) WHERE status IN (OLD.status, NEW.status);
    END IF;
END
//

MyISAM is actually pretty fast with count(*) the downside is that the MyISAM storage is not that reliable and best avoided where data integrity is critical.

InnoDB can be very slow to perform count(*) type queries, cause it is designed to allow for multiple concurrent views of the same data. So at any point in time, its not enough to go to the index to get the count.

From: http://www.mail-archive.com/[email protected]/msg120320.html

The database starts with 1000 records in it I start a transaction You start a transaction I delete 50 records You add 50 records I do a COUNT() and see 950 records. You do a COUNT() and see 1050 records. I commit my transaction - database now has 950 records to everyone but you. You commit your transaction - database has 1000 records again.

How InnoDB keeps up with which records are "visible" or "modifiable" with respect to any transaction is through row-level locking, transaction isolation levels, and multi-versioning. http://dev.mysql.com/doc/refman/4.1/en/innodb-transaction-model.html http://dev.mysql.com/doc/refman/4.1/en/innodb-multi-versioning.html

That is what makes counting how many records each person can see is not so straight-forward.

So bottom line is you will need to look at caching the counts somehow as opposed to going to the table if you need to get at this information frequently and fast.


from: http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html

InnoDB does not keep an internal count of rows in a table. (In practice, this would be somewhat complicated due to multi-versioning.) To process a SELECT COUNT(*) FROM t statement, InnoDB must scan an index of the table, which takes some time if the index is not entirely in the buffer pool.

The solution suggested is:

To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does. SHOW TABLE STATUS also can be used if an approximate row count is sufficient.

In short: count(*) (on innoDB) will take a long time time for tables containing a large number of rows. This is by design and can't be helped.

Write your own workaround.


There was no significant difference between count(*), count(status), or count(1)

count(column) returns the number of lines where column is NOT NULL. Since 1 is NOT NULL, and status is also, presumably, NOT NULL, the database will optimize away the test and convert them all to count(*). Which, ironically, does not mean "count lines where all columns are not null" (or any other combination), it just means "count lines"...

Now, back to your question, you can't have your cake and eat it...

  • If you want an "exact" count to be available at all times, then you have to increment and decrement in real time, via triggers, which slows down your writes

  • Or you can use count(*), but this will be slow

  • Or you can settle for a rough estimate, or an out-of-date value, and use caching or other probabilistic approaches.

Generally, on values above about "a few", NO-ONE is interested in an exact real-time count. It is a red herring anyway, as by the time you read it, the value will most likely have changed.