选择随机行PostgreSQL的最佳方法 [英] Best way to select random rows PostgreSQL
问题描述
我想要在PostgreSQL中随机选择行,我尝试过这样:
I want a random selection of rows in PostgreSQL, I tried this:
select * from table where random() < 0.01;
但是其他一些建议:
select * from table order by random() limit 1000;
我有一个很大的表,有5亿行,我希望它能很快.
I have a very large table with 500 Million rows, I want it to be fast.
哪种方法更好?有什么区别?选择随机行的最佳方法是什么?
Which approach is better? What are the differences? What is the best way to select random rows?
推荐答案
给出您的说明(在注释中加上其他信息),
Given your specifications (plus additional info in the comments),
- 您有一个数字ID列(整数),其中只有很少(或很少有)空白.
- 显然没有或只有很少的写操作.
- 您的ID列必须被索引!主键很好用.
下面的查询不需要大表的顺序扫描,只需要索引扫描即可.
The query below does not need a sequential scan of the big table, only an index scan.
首先,获取主要查询的估算值:
First, get estimates for the main query:
SELECT count(*) AS ct -- optional
, min(id) AS min_id
, max(id) AS max_id
, max(id) - min(id) AS id_span
FROM big;
唯一可能昂贵的部分是count(*)
(用于大型表).鉴于上述规格,您不需要它.估算就可以了,几乎可以免费获得(
The only possibly expensive part is the count(*)
(for huge tables). Given above specifications, you don't need it. An estimate will do just fine, available at almost no cost (detailed explanation here):
SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass;
只要ct
不小于id_span
小,否则查询的性能将优于其他方法.
As long as ct
isn't much smaller than id_span
, the query will outperform other approaches.
WITH params AS (
SELECT 1 AS min_id -- minimum id <= current min id
, 5100000 AS id_span -- rounded up. (max_id - min_id + buffer)
)
SELECT *
FROM (
SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
FROM params p
,generate_series(1, 1100) g -- 1000 + buffer
GROUP BY 1 -- trim duplicates
) r
JOIN big USING (id)
LIMIT 1000; -- trim surplus
-
在
id
空间中生成随机数.您的差距很小,因此,要检索的行数增加10%(足以轻松覆盖空白).Generate random numbers in the
id
space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.每个
id
可能会被多次选择(尽管ID空间很大的可能性很小),因此请对生成的数字进行分组(或使用DISTINCT
).Each
id
can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or useDISTINCT
).将
id
联接到大表中.有了索引后,这应该很快.Join the
id
s to the big table. This should be very fast with the index in place.最后,减少未被骗子和缺口吃掉的剩余
id
.每行都有一个完全相等的机会.Finally trim surplus
id
s that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.您可以简化此查询.上面查询中的CTE仅出于教育目的:
You can simplify this query. The CTE in the query above is just for educational purposes:
SELECT * FROM ( SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id FROM generate_series(1, 1100) g ) r JOIN big USING (id) LIMIT 1000;
使用rCTE精炼
特别是如果您不确定差距和估算值.
Refine with rCTE
Especially if you are not so sure about gaps and estimates.
WITH RECURSIVE random_pick AS ( SELECT * FROM ( SELECT 1 + trunc(random() * 5100000)::int AS id FROM generate_series(1, 1030) -- 1000 + few percent - adapt to your needs LIMIT 1030 -- hint for query planner ) r JOIN big b USING (id) -- eliminate miss UNION -- eliminate dupe SELECT b.* FROM ( SELECT 1 + trunc(random() * 5100000)::int AS id FROM random_pick r -- plus 3 percent - adapt to your needs LIMIT 999 -- less than 1000, hint for query planner ) r JOIN big b USING (id) -- eliminate miss ) SELECT * FROM random_pick LIMIT 1000; -- actual limit
我们可以在基本查询中使用 较小的剩余量 .如果间隙太多,那么在第一次迭代中我们找不到足够的行,则rCTE会继续使用递归项进行迭代.在ID空间中我们仍然需要相对较少的间隙,否则递归可能在达到极限之前就枯竭了-或者我们必须从足够大的缓冲区开始,这不利于优化性能.
We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.
重复项被rCTE中的
UNION
消除.Duplicates are eliminated by the
UNION
in the rCTE.外部
LIMIT
使CTE一旦我们有足够的行就停止.The outer
LIMIT
makes the CTE stop as soon as we have enough rows.此查询经过精心设计,以使用可用的索引,实际上生成随机行,并且直到我们达到限制后才停止(除非递归运行干了).如果要重写它,这里有很多陷阱.
This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.
要与不同的参数重复使用:
For repeated use with varying parameters:
CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03) RETURNS SETOF big AS $func$ DECLARE _surplus int := _limit * _gaps; _estimate int := ( -- get current estimate from system SELECT c.reltuples * _gaps FROM pg_class c WHERE c.oid = 'big'::regclass); BEGIN RETURN QUERY WITH RECURSIVE random_pick AS ( SELECT * FROM ( SELECT 1 + trunc(random() * _estimate)::int FROM generate_series(1, _surplus) g LIMIT _surplus -- hint for query planner ) r (id) JOIN big USING (id) -- eliminate misses UNION -- eliminate dupes SELECT * FROM ( SELECT 1 + trunc(random() * _estimate)::int FROM random_pick -- just to make it recursive LIMIT _limit -- hint for query planner ) r (id) JOIN big USING (id) -- eliminate misses ) SELECT * FROM random_pick LIMIT _limit; END $func$ LANGUAGE plpgsql VOLATILE ROWS 1000;
致电:
SELECT * FROM f_random_sample(); SELECT * FROM f_random_sample(500, 1.05);
您甚至可以使此泛型适用于任何表:将PK列的名称和表作为多态类型并使用
EXECUTE
...但这超出了此问题的范围.参见:You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use
EXECUTE
... But that's beyond the scope of this question. See:如果您的要求允许重复的呼叫设置相同(并且我们正在谈论重复的呼叫),我将考虑使用物化视图.一次执行上述查询,然后将结果写入表中.用户以闪电般的速度获得准随机选择.在您选择的时间间隔或事件中刷新您的随机选择.
IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.
其中
n
是百分比. 手册:BERNOULLI
和SYSTEM
采样方法每个都接受一个 参数是要采样的表的分数,表示为 介于0到100之间的百分比.此参数可以是任何real
值的表达式.The
BERNOULLI
andSYSTEM
sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be anyreal
-valued expression.强调粗体.这是非常快,但结果是不是完全随机的.再次使用手册:
Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:
SYSTEM
方法明显快于BERNOULLI
方法 指定较小的采样百分比时,但可能会返回 由于聚类的影响,表格的样本数较少.The
SYSTEM
method is significantly faster than theBERNOULLI
method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.返回的行数可能相差很大.对于我们的示例,要大致获得 1000行:
The number of rows returned can vary wildly. For our example, to get roughly 1000 rows:
SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);
相关:
或安装附加模块 tsm_system_rows 即可准确获取所请求的行数(如果有的话),并允许使用更方便的语法:
Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:
SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);
有关详细信息,请参见 Evan的答案.
See Evan's answer for details.
但这仍然不是完全随机的.
But that's still not exactly random.
这篇关于选择随机行PostgreSQL的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!