选择随机行PostgreSQL的最佳方法 [英] Best way to select random rows PostgreSQL

查看:94
本文介绍了选择随机行PostgreSQL的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要在PostgreSQL中随机选择行,我尝试过这样:

I want a random selection of rows in PostgreSQL, I tried this:

select * from table where random() < 0.01;

但是其他一些建议:

select * from table order by random() limit 1000;

我有一个很大的表,有5亿行,我希望它能很快.

I have a very large table with 500 Million rows, I want it to be fast.

哪种方法更好?有什么区别?选择随机行的最佳方法是什么?

Which approach is better? What are the differences? What is the best way to select random rows?

推荐答案

给出您的说明(在注释中加上其他信息),

Given your specifications (plus additional info in the comments),

  • 您有一个数字ID列(整数),其中只有很少(或很少有)空白.
  • 显然没有或只有很少的写操作.
  • 您的ID列必须被索引!主键很好用.

下面的查询不需要大表的顺序扫描,只需要索引扫描即可.

The query below does not need a sequential scan of the big table, only an index scan.

首先,获取主要查询的估算值:

First, get estimates for the main query:

SELECT count(*) AS ct              -- optional
     , min(id)  AS min_id
     , max(id)  AS max_id
     , max(id) - min(id) AS id_span
FROM   big;

唯一可能昂贵的部分是count(*)(用于大型表).鉴于上述规格,您不需要它.估算就可以了,几乎可以免费获得(

The only possibly expensive part is the count(*) (for huge tables). Given above specifications, you don't need it. An estimate will do just fine, available at almost no cost (detailed explanation here):

SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass;

只要ct不小于id_span小,否则查询的性能将优于其他方法.

As long as ct isn't much smaller than id_span, the query will outperform other approaches.

WITH params AS (
    SELECT 1       AS min_id           -- minimum id <= current min id
         , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
    )
SELECT *
FROM  (
    SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
    FROM   params p
          ,generate_series(1, 1100) g  -- 1000 + buffer
    GROUP  BY 1                        -- trim duplicates
    ) r
JOIN   big USING (id)
LIMIT  1000;                           -- trim surplus

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆