POSTGRES在数百万条记录中按时间选择n个均匀分布的行 [英] POSTGRES select n equally distributed rows by time over millions of records

查看:48
本文介绍了POSTGRES在数百万条记录中按时间选择n个均匀分布的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表,其中的列 id,filter1,filter2,time,value 包含数百万条记录.我想获取两个时间戳之间均匀分布的 n 行.如果时间戳之间的记录数少于 n ,我想获取所有记录.

I have a table with columns id,filter1,filter2,time,value which contains millions of records. I want to fetch n equally distributed rows between two timestamps. If the number of records between timestamps is less than n I want to fetch all the records.

假设 n = 200

SELECT s.* FROM (
    SELECT t.time, t.value,
           ROW_NUMBER() OVER(ORDER BY t.time) as rnk,
           COUNT(*) OVER() as total_cnt
    FROM table_name t
    WHERE t.filter1='filter_value' 
    and t.filter2='another_value' 
    and t.time between '2020-04-18' AND '2020-04-19') s

WHERE MOD(s.rnk,(total_cnt/200)) = 0 ;

我在'filter1,filter2,time'上有一个索引.当大约有1000万条记录时,此查询仍然非常慢.

I have an index on 'filter1,filter2,time'. Still this query is extremely slow when there is around 10 million records.

我也尝试过 TABLESAMPLE ,但是我无法为百分比设定合适的条件,该条件足够快,并且当行数较少时也会返回所有行.

I have also tried TABLESAMPLE but I couldn't come up with an appropriate condition for percentage which is fast enough and also returns all rows when the number of rows is less.

推荐答案

如果...

  • ...您没有有关逻辑或物理数据分配的其他元信息
  • ...,并且需要将选择项在时间上平均分配

...那么您的原始查询基本上就和它得到的一样好.像Gordon建议的那样,您在(filter1,filter2,time)上具有索引.如果少于百分之几的筛选器通过,则有很大帮助.然后,我们必须对所有符合条件的行(许多符合条件的行的昂贵部分)进行计数和编号,以在样本中获得严格均匀的分布.

... then your original query is basically as good as it gets. You have the index on (filter1,filter2,time) like Gordon suggested. It helps (a lot) if less than a few percent pass the filters. We then have to count and number all qualifying rows (the expensive part for many qualifying rows) to get a strictly uniform distribution in the sample.

一些小建议:

SELECT s.*
FROM  (
   SELECT t.time, t.value
        , row_number() OVER (ORDER BY t.time) AS rn  -- ①
        , count(*) OVER() AS total_cnt
   FROM   table_name t
   WHERE  t.filter1 = 'filter_value' 
   AND    t.filter2 = 'another_value' 
   AND    t.time >= '2020-04-18'  -- assuming data type timestamp!
   AND    t.time <  '2020-04-20'  -- ②
   ) s
WHERE  mod(s.rn, total_cnt/n) = total_cnt/n/2 + 1;  -- ③

①为 rank() .

① Use the column alias rn (or whatever) for row_number(); rnk would hint at rank().

②假定列"time" 是数据类型 timestamp ,因为 date time 都不有意义.(时间"似乎具有误导性.)因此,该谓词最有可能是错误的:

② Assuming the column "time" is data type timestamp as neither date nor time would make sense. ("time" seems misleading.) So this predicate is most likely wrong:

t.time between '2020-04-18' AND '2020-04-19'

给定的日期文字被强制为时间戳 2020-04-18 0:0 / 2020-04-19 0:0 .由于 BETWEEN 包括上下限,过滤器有效地选择2020-04-18的全部加上2020-04-19的第一个瞬间.几乎没有道理.我建议的修复程序包括2020-04-18和2020-04-19的所有内容.

The given date literals are coerced to timestamps 2020-04-18 0:0 / 2020-04-19 0:0. Since BETWEEN includes lower and upper bound the filter effectively selects all of 2020-04-18 plus the first instant of 2020-04-19. Hardly ever makes sense. My suggested fix includes all of 2020-04-18 and 2020-04-19.

如果列"time" 的数据类型为 timestamptz ,则以上内容同样适用.另外,您可以将对数据库会话的时区设置的依赖项添加到组合中.别!参见:

If the column "time" is data type timestamptz, then the above basically applies as well. Plus, you add a dependency on the timezone setting of the database session into the mix. Don't! See:

③您的原始条件 MOD(s.rnk,(total_cnt/n))= 0 选择第 total_cnt/n 行中的每一行,始终跳过第一行 total_cnt/n-1 行,这会为以后的行创建一个 bias .为了说明:

③ Your original condition MOD(s.rnk,(total_cnt/n)) = 0 picks every total_cnt/n-th row, always skipping the first total_cnt/n - 1 rows, which creates a bias for later rows. To illustrate:

ooooXooooXooooXooooX

我的选择是将选择范围移到中心,这似乎更合理:

My alternative shifts the selection towards the center, which seems more reasonable:

ooXooooXooooXooooXoo

整数除法可能会产生0.加1( total_cnt/n/2 + 1 )可以防止这种情况的发生.另外,无论如何,它都位于中心".

Integer division might produce 0. Adding 1 (total_cnt/n/2 + 1) prevents that from happening. Plus it's more in the "center" in any case.

最后,应该指出, time 中相等值的结果是任意的.如果这很重要,您可能想定义一个决胜局...

Finally, it should be mentioned that the result for equal values in time is arbitrary. You might want to define a tiebreaker if that matters ...

也就是说,我们也许可以使用关于数据分发的 任何元信息 ,以使我们受益.或者,如果我们可以放宽对样品中严格均匀分布的要求(到什么程度?).

That said, we might be able to use any meta information about data distribution to our advantage. Or if we can relax requirements for strictly uniform distribution in the sample (to what degree?).


如果我们可以假设(filter1,filter2)的所有(或某些)组合随时间均匀的数据分布,我们可以对时间间隔进行划分,而不必担心 n 非常便宜的索引扫描(仅).(或者,如果我们不太在乎统一的数据分布,则还是可以这样做.)为说明:

If we can assume uniform data distribution over time for all (or some) combinations of (filter1, filter2) we can just partition the time interval and get away with n very cheap index (only) scans. (Or if we don't care about uniform data distribution too much, we might do it anyway.) To illustrate:

WITH input (f1    , f2    , lo                    , hi                    , n) AS (
   VALUES  ('val2', 'val2', timestamp '2020-04-18', timestamp '2020-04-20', 200)
   )
SELECT g.lo, s.*
FROM   (SELECT *, (hi - lo) / n AS span FROM input) i
CROSS  JOIN generate_series(lo, hi - span, span) g(lo)
LEFT   JOIN LATERAL (   
   SELECT t.time, t.value
   FROM   table_name t
   WHERE  t.filter1 = i.f1
   AND    t.filter2 = i.f2
   AND    t.time >= g.lo
   AND    t.time <  g.lo + span
   ORDER  BY time
   LIMIT  1
   ) s ON true;

这只是一个概念证明,可以用一百一种方式进行调整.该查询中发生了很多事情,没有足够的信息来简化案例.

This is just a proof of concept, which can be tweaked in a hundred and one ways. There is a lot going on in this query and not enough information about the case to streamline.

主要目的是避免处理所有行,而仅获取要返回的行.

The main objective is to avoid processing all rows, and only fetch the ones to return.

查询从下限开始,产生如下选择模式:

The query starts at the lower bound, producing a selection pattern like:

XooooXooooXooooXoooo

LEFT JOIN 在结果中保留空的时间片,这表示数据分布不均匀.

The LEFT JOIN keeps empty time slices in the result, which indicate non-uniform data distribution.

关于表设计,数据分布,写模式等的任何元信息都可以用于进一步优化.索引可以优化:仅索引扫描,部分索引,...

Any kind of meta information on table design, data distribution, write patterns etc. might be used to optimize further. Indexing might be optimized: index-only scans, partial indexes, ...

这篇关于POSTGRES在数百万条记录中按时间选择n个均匀分布的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆