How to select one row randomly taking into account a weight?
I have a table which looks like that:
id: primary key
content: varchar
weight: int
What I want to do is randomly select one row from this table, but taking into account the weight. For example, if I have 3 rows:
id, content, weight
1, "some content", 60
2, "other content", 40
3, "something", 100
The first row has 30% chance of being selected, the second row has 20% chance of being selected, and the third row has 50% chance of being selected.
Is there a way to do that? If I have to execute 2 or 3 queries it's not a problem.
Solution 1:
I think the simplest is actually to use the weighted reservoir sampling:
SELECT
id,
-LOG(RAND()) / weight AS priority
FROM
your_table
ORDER BY priority
LIMIT 1;
It's a great method that lets you choose M out of N elements where the probability to be chosen for each element is proportional to its weight. It works just as well when you happen to only want one element. The method is described in this article. Note that they choose the biggest values of POW(RAND(), 1/weight), which is equivalent to choosing the smallest values of -LOG(RAND()) / weight.
Solution 2:
This works in MSSQL and I am sure that it should be possible to change couple of keywords to make it work in MySQL as well (maybe even nicer):
SELECT TOP 1 t.*
FROM @Table t
INNER JOIN (SELECT t.id, sum(tt.weight) AS cum_weight
FROM @Table t
INNER JOIN @Table tt ON tt.id <= t.id
GROUP BY t.id) tc
ON tc.id = t.id,
(SELECT SUM(weight) AS total_weight FROM @Table) tt,
(SELECT RAND() AS rnd) r
WHERE r.rnd * tt.total_weight <= tc.cum_weight
ORDER BY t.id ASC
The idea is to have a cumulative weight for each row (subselect-1), then find the position of the spanned RAND() in this cumulative range.
Solution 3:
I have tried van's solution and, although it works, it is not quick.
My Solution
The way that I am solving this problem is by maintaining a separate, linked table for the weighting. The basic table structure is similar to this:
CREATE TABLE `table1` (
`id` int(11) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
`name` varchar(100),
`weight` tinyint(4) NOT NULL DEFAULT '1',
);
CREATE TABLE `table1_weight` (
`id` bigint(20) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
`table1_id` int(11) NOT NULL
);
If I have a record in table1
with a weight of 3, then I create 3 records in table1_weight
, linked to table1
via the table1_id
field. Whatever the value of weight
is in table1
, that's how many linked records I create in table1_weight
.
Testing
On a dataset with 976 records in table1
with a total weight of 2031 and hence 2031 records in table1_weight
, I ran the following two SQLs:
-
A version of van's solution
SELECT t.* FROM table1 t INNER JOIN ( SELECT t.id, SUM(tt.weight) AS cum_weight FROM table1 t INNER JOIN table1 tt ON tt.id <= t.id GROUP BY t.id) tc ON tc.id = t.id, ( SELECT SUM(weight) AS total_weight FROM table1) tt, ( SELECT RAND() AS rnd) r WHERE r.rnd * tt.total_weight <= tc.cum_weight ORDER BY t.id ASC LIMIT 1
Joining to a secondary table for the weighting
SELECT t.*
FROM table1 t
INNER JOIN table1_weight w
ON w.table1_id = t.id
ORDER BY RAND()
LIMIT 1
SQL 1 takes consistently 0.4 seconds.
SQL 2 takes between 0.01 and 0.02 seconds.
Conclusion
If the speed of selection of a random, weighted record is not an issue, then the single table SQL suggested by van is fine and doesn't have the overhead of maintaining a separate table.
If, as in my case, a short selection time is critical, then I would recommend the two table method.