选择随机行的最佳方法 PostgreSQL [英] Best way to select random rows PostgreSQL

查看:29
本文介绍了选择随机行的最佳方法 PostgreSQL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 PostgreSQL 中随机选择行,我试过这个:

I want a random selection of rows in PostgreSQL, I tried this:

select * from table where random() < 0.01;

但其他一些人推荐这个:

But some other recommend this:

select * from table order by random() limit 1000;

我有一个包含 5 亿行的非常大的表,我希望它很快.

I have a very large table with 500 Million rows, I want it to be fast.

哪种方法更好?有什么区别?选择随机行的最佳方法是什么?

Which approach is better? What are the differences? What is the best way to select random rows?

推荐答案

根据您的规范(加上评论中的其他信息),

Given your specifications (plus additional info in the comments),

  • 您有一个数字 ID 列(整数),只有很少(或很少)间隙.
  • 显然没有或很少写操作.
  • 您的 ID 列必须编入索引!主键可以很好地发挥作用.

下面的查询不需要大表的顺序扫描,只需要索引扫描.

The query below does not need a sequential scan of the big table, only an index scan.

首先,获取主查询的估计值:

First, get estimates for the main query:

SELECT count(*) AS ct              -- optional
     , min(id)  AS min_id
     , max(id)  AS max_id
     , max(id) - min(id) AS id_span
FROM   big;

唯一可能昂贵的部分是 count(*)(对于大表).鉴于上述规格,您不需要它.估计会很好,几乎免费提供(此处有详细说明):

The only possibly expensive part is the count(*) (for huge tables). Given above specifications, you don't need it. An estimate will do just fine, available at almost no cost (detailed explanation here):

SELECT reltuples AS ct FROM pg_class
WHERE oid = 'schema_name.big'::regclass;

只要 ct 不比 id_span 很多,查询的性能就会优于其他方法.

As long as ct isn't much smaller than id_span, the query will outperform other approaches.

WITH params AS (
   SELECT 1       AS min_id           -- minimum id <= current min id
        , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
    )
SELECT *
FROM  (
   SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
   FROM   params p
         ,generate_series(1, 1100) g  -- 1000 + buffer
   GROUP  BY 1                        -- trim duplicates
) r
JOIN   big USING (id)
LIMIT  1000;                          -- trim surplus

  • id 空间生成随机数.您有很少的空白",因此将 10%(足以轻松覆盖空白)添加到要检索的行数.

    • Generate random numbers in the id space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.

      每个 id 都可以被随机选取多次(尽管 id 空间很大时不太可能),因此将生成的数字分组(或使用 DISTINCT).

      Each id can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT).

      id加入到大表中.有了索引,这应该会很快.

      Join the ids to the big table. This should be very fast with the index in place.

      最后修剪没有被欺骗和间隙吃掉的剩余id.每一行都有完全平等的机会被选中.

      Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.

      您可以简化这个查询.上面查询中的 CTE 仅用于教育目的:

      You can simplify this query. The CTE in the query above is just for educational purposes:

      SELECT *
      FROM  (
         SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
         FROM   generate_series(1, 1100) g
         ) r
      JOIN   big USING (id)
      LIMIT  1000;
      

      使用 rCTE 进行优化

      特别是如果您对差距和估计不太确定.

      Refine with rCTE

      Especially if you are not so sure about gaps and estimates.

      WITH RECURSIVE random_pick AS (
         SELECT *
         FROM  (
            SELECT 1 + trunc(random() * 5100000)::int AS id
            FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs
            LIMIT  1030                      -- hint for query planner
            ) r
         JOIN   big b USING (id)             -- eliminate miss
      
         UNION                               -- eliminate dupe
         SELECT b.*
         FROM  (
            SELECT 1 + trunc(random() * 5100000)::int AS id
            FROM   random_pick r             -- plus 3 percent - adapt to your needs
            LIMIT  999                       -- less than 1000, hint for query planner
            ) r
         JOIN   big b USING (id)             -- eliminate miss
         )
      TABLE  random_pick
      LIMIT  1000;  -- actual limit
      

      我们可以在基本查询中使用较小的盈余.如果间隙太多以至于我们在第一次迭代中找不到足够的行,则 rCTE 将继续使用递归项进行迭代.我们仍然需要 ID 空间中相对很少的间隙,否则递归可能会在达到限制之前耗尽 - 或者我们必须从足够大的缓冲区开始,这违背了优化性能的目的.

      We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.

      通过 rCTE 中的 UNION 消除重复项.

      Duplicates are eliminated by the UNION in the rCTE.

      只要我们有足够的行,外部的 LIMIT 就会使 CTE 停止.

      The outer LIMIT makes the CTE stop as soon as we have enough rows.

      此查询经过精心设计,以使用可用索引,生成实际随机的行,并且在达到限制之前不会停止(除非递归运行枯竭).如果您要重写它,这里有许多陷阱.

      This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.

      重复使用不同的参数:

      CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
        RETURNS SETOF big
        LANGUAGE plpgsql VOLATILE ROWS 1000 AS
      $func$
      DECLARE
         _surplus  int := _limit * _gaps;
         _estimate int := (           -- get current estimate from system
            SELECT c.reltuples * _gaps
            FROM   pg_class c
            WHERE  c.oid = 'big'::regclass);
      BEGIN
         RETURN QUERY
         WITH RECURSIVE random_pick AS (
            SELECT *
            FROM  (
               SELECT 1 + trunc(random() * _estimate)::int
               FROM   generate_series(1, _surplus) g
               LIMIT  _surplus           -- hint for query planner
               ) r (id)
            JOIN   big USING (id)        -- eliminate misses
      
            UNION                        -- eliminate dupes
            SELECT *
            FROM  (
               SELECT 1 + trunc(random() * _estimate)::int
               FROM   random_pick        -- just to make it recursive
               LIMIT  _limit             -- hint for query planner
               ) r (id)
            JOIN   big USING (id)        -- eliminate misses
         )
         TABLE  random_pick
         LIMIT  _limit;
      END
      $func$;
      

      调用:

      SELECT * FROM f_random_sample();
      SELECT * FROM f_random_sample(500, 1.05);
      

      您甚至可以使这个通用适用于任何表:将 PK 列的名称和表作为多态类型并使用 EXECUTE ......但这超出了本问题的范围.见:

      You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use EXECUTE ... But that's beyond the scope of this question. See:

      如果您的要求允许重复调用的相同集合(我们正在谈论重复调用),我会考虑物化视图.执行一次上述查询并将结果写入表.用户以闪电般的速度获得准随机选择.以您选择的时间间隔或事件刷新您的随机选择.

      IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.

      其中 n 是一个百分比.手册:

      BERNOULLISYSTEM 采样方法每个都接受一个参数是要采样的表的分数,表示为0 到 100 之间的百分比.此参数可以是任何 real 值表达式.

      The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression.

      粗体强调我的.它非常快,但结果不完全随机.又是说明书:

      Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:

      SYSTEM 方法明显快于 BERNOULLI 方法当指定较小的采样百分比时,但它可能会返回一个由于聚类效应,表的随机样本较少.

      The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

      返回的行数可能会有很大差异.对于我们的示例,要获得大约 1000 行:

      The number of rows returned can vary wildly. For our example, to get roughly 1000 rows:

      SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);
      

      相关:

      安装附加模块tsm_system_rows 以准确获取请求的行数(如果有足够的行数)并允许使用更方便的语法:

      Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:

      SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);
      

      有关详细信息,请参阅埃文的回答.

      See Evan's answer for details.

      但这仍然不是完全随机的.

      But that's still not exactly random.

      这篇关于选择随机行的最佳方法 PostgreSQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆