从具有加权行概率的 PostgreSQL 表中选择随机行 [英] Select random row from a PostgreSQL table with weighted row probabilities
问题描述
示例输入:
<前>选择 * 从测试;身份证 |百分----+------------1 |502 |353 |15(3 行)你会如何编写这样的查询,平均有 50% 的时间我可以获得 id=1 的行,35% 的时间行的 id=2,以及 15% 的时间行的 id=3?
我尝试了类似 SELECT id FROM test ORDER BY p * random() DESC LIMIT 1
的方法,但它给出了错误的结果.运行 10,000 次后,我得到如下分布:{1=6293, 2=3302, 3=405}
,但我预计分布接近:{1=5000, 2=3500, 3=1500}
.
有什么想法吗?
这应该可以解决问题:
WITH CTE AS (SELECT random() * (SELECT SUM(percent) FROM YOUR_TABLE) R)选择 *从 (SELECT id, SUM(percent) OVER (ORDER BY id) S, R从 YOUR_TABLE 交叉加入 CTE) 问其中 S > = R按 ID 排序限制 1;
子查询 Q
给出以下结果:
1 502 853 100
然后我们简单地生成一个范围为 [0, 100) 的随机数,并选择等于或超过该数字的第一行(WHERE
子句).我们使用公用表表达式(WITH
)来保证随机数只计算一次.
顺便说一句,SELECT SUM(percent) FROM YOUR_TABLE
允许您在 percent
中有任何权重 - 它们并不严格需要是百分比(即加起来是 100).
Example input:
SELECT * FROM test; id | percent ----+---------- 1 | 50 2 | 35 3 | 15 (3 rows)
How would you write such query, that on average 50% of time i could get the row with id=1, 35% of time row with id=2, and 15% of time row with id=3?
I tried something like SELECT id FROM test ORDER BY p * random() DESC LIMIT 1
, but it gives wrong results. After 10,000 runs I get a distribution like: {1=6293, 2=3302, 3=405}
, but I expected the distribution to be nearly: {1=5000, 2=3500, 3=1500}
.
Any ideas?
This should do the trick:
WITH CTE AS (
SELECT random() * (SELECT SUM(percent) FROM YOUR_TABLE) R
)
SELECT *
FROM (
SELECT id, SUM(percent) OVER (ORDER BY id) S, R
FROM YOUR_TABLE CROSS JOIN CTE
) Q
WHERE S >= R
ORDER BY id
LIMIT 1;
The sub-query Q
gives the following result:
1 50
2 85
3 100
We then simply generate a random number in range [0, 100) and pick the first row that is at or beyond that number (the WHERE
clause). We use common table expression (WITH
) to ensure the random number is calculated only once.
BTW, the SELECT SUM(percent) FROM YOUR_TABLE
allows you to have any weights in percent
- they don't strictly need to be percentages (i.e. add-up to 100).
这篇关于从具有加权行概率的 PostgreSQL 表中选择随机行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!