连续重复/重复的有序计数 [英] Ordered count of consecutive repeats / duplicates

查看:86
本文介绍了连续重复/重复的有序计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我非常怀疑我是否以最有效的方式执行此操作,这就是为什么我在此处标记了plpgsql的原因.对于千个测量系统,我需要在 20亿行上运行.

I highly doubt I'm doing this in the most efficient manner, which is why I tagged plpgsql on here. I need to run this on 2 billion rows for a thousand measurement systems.

您拥有一些测量系统,当它们失去连接时,它们通常会报告以前的值,并且有时会(但有时很长)会突然断开连接.您需要汇总,但是这样做时,您需要查看其重复的时间,并根据该信息进行各种筛选.假设您正在测量汽车的mpg,但它在20 mpg的情况下停留一个小时,然后移动到20.1,依此类推.您将需要评估卡住时的准确性.您还可以放置一些替代规则,以查找汽车何时在高速公路上,并使用窗口功能来生成汽车的状态"并进行分组.事不宜迟:

You have measurement systems that often report the previous value when they lose connectivity, and they lose connectivity for spurts often but sometimes for a long time. You need to aggregate but when you do so, you need to look at how long it was repeating and make various filters based on that information. Say you are measuring mpg on a car but it's stuck at 20 mpg for an hour than moves around to 20.1 and so on. You'll want to evaluate the accuracy when it's stuck. You could also place some alternative rules that look for when the car is on the highway, and with window functions you can generate the 'state' of the car and have something to group on. Without further ado:

--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
    (
    select 
    system_measured, time_of_measurement, measurement, 
    case when 
     measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc) 
     then 1 else 0 end as repeat
    FROM
    (
    SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
    ) as data
) as data;

--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
    (
    select 
    *,
    sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
    from cumulative_repeat_calculator_data
    order by system_measured, time_of_measurement
) as data;

--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example  
select *, 
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement

那么,为了在一个巨大的表上运行它,您将采取什么不同的操作,或者您将使用哪些替代工具?我在考虑plpgsql,因为我怀疑这需要在数据库中完成,也可以在数据插入过程中完成,尽管我通常在加载数据后使用它.有什么方法可以一目了然地实现,而无需诉诸子查询?

So, what would you do differently in order to run this on a huge table, or what alternative tools would you use? I'm thinking plpgsql because I suspect this needs to done in-database, or during the data insertion process, although I generally work with the data after it's loaded. Is there any way to get this in one sweep without resorting to sub-queries?

我已经测试了一种替代方法,但是它仍然依赖于子查询,并且我认为这样做更快.对于该方法,您将使用start_timestamp,end_timestamp,system创建一个开始和停止"表.然后,您连接到较大的表,并且如果时间戳介于两者之间,则将其分类为处于该状态,这实际上是cumlative_sum_of_nonrepeats_by_system的替代方法.但是,当您执行此操作时,成千上万个设备和成千上万个事件"将以1 = 1的比例加入.您认为这是更好的方法吗?

I have tested one alternative method, but it still relies on a sub-query and I think this is faster. For that method you create a "starts and stops" table with start_timestamp, end_timestamp, system. Then you join to the larger table and if the timestamp is between those, you classify it as being in that state, which is essentially an alternative to cumlative_sum_of_nonrepeats_by_system. But when you do this, you join on 1=1 for thousands of devices and thousands or millions of 'events'. Do you think that's a better way to go?

推荐答案

测试用例

首先,在 sqlfiddle ,可以使用:

Test case

First, a more useful way to present your data - or even better, in an sqlfiddle, ready to play with:

CREATE TEMP TABLE data(
   system_measured int
 , time_of_measurement int
 , measurement int
);

INSERT INTO data VALUES
 (1, 1, 5)
,(1, 2, 150)
,(1, 3, 5)
,(1, 4, 5)
,(2, 1, 5)
,(2, 2, 5)
,(2, 3, 5)
,(2, 4, 5)
,(2, 5, 150)
,(2, 6, 5)
,(2, 7, 5)
,(2, 8, 5);

简化查询

由于尚不清楚,因此我仅假设上述内容.
接下来,我简化了查询,以得出:

Simplified query

Since it remains unclear, I am assuming only the above as given.
Next, I simplified your query to arrive at:

WITH x AS (
   SELECT *, CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
                               ORDER BY time_of_measurement) = measurement
                  THEN 0 ELSE 1 END AS step
   FROM   data
   )
   , y AS (
   SELECT *, sum(step) OVER(PARTITION BY system_measured
                            ORDER BY time_of_measurement) AS grp
   FROM   x
   )
SELECT * ,row_number() OVER (PARTITION BY system_measured, grp
                             ORDER BY time_of_measurement) - 1 AS repeat_ct
FROM   y
ORDER  BY system_measured, time_of_measurement;

现在,虽然使用纯SQL很好,但使用plpgsql函数会更快 ,因为它可以在单表扫描中完成,而该表查询至少需要此查询三遍扫描.

Now, while it is all nice and shiny to use pure SQL, this will be much faster with a plpgsql function, because it can do it in a single table scan where this query needs at least three scans.

CREATE OR REPLACE FUNCTION x.f_repeat_ct()
  RETURNS TABLE (
    system_measured int
  , time_of_measurement int
  , measurement int, repeat_ct int
  )  LANGUAGE plpgsql AS
$func$
DECLARE
   r    data;     -- table name serves as record type
   r0   data;
BEGIN

-- SET LOCAL work_mem = '1000 MB';  -- uncomment an adapt if needed, see below!

repeat_ct := 0;   -- init

FOR r IN
   SELECT * FROM data d ORDER BY d.system_measured, d.time_of_measurement
LOOP
   IF  r.system_measured = r0.system_measured
       AND r.measurement = r0.measurement THEN
      repeat_ct := repeat_ct + 1;   -- start new array
   ELSE
      repeat_ct := 0;               -- start new count
   END IF;

   RETURN QUERY SELECT r.*, repeat_ct;

   r0 := r;                         -- remember last row
END LOOP;

END
$func$;

致电:

SELECT * FROM x.f_repeat_ct();

请务必在此类plpgsql函数中始终对表名进行表限定,因为我们使用与输出参数相同的名称,如果不限定则使用优先级.

Be sure to table-qualify your column names at all times in this kind of plpgsql function, because we use the same names as output parameters which would take precedence if not qualified.

如果您有 十亿个行,则可能需要拆分此操作.我在此处

注:RETURN NEXTRETURN QUERY的当前实现 在从函数返回之前存储整个结果集,如下所示: 上面讨论过.这意味着如果PL/pgSQL函数产生一个 结果集非常大,性能可能很差:将写入数据 到磁盘以避免内存耗尽,但是函数本身不会 返回,直到生成了整个结果集.未来 PL/pgSQL版本可能允许用户定义返回集 没有此限制的功能.目前,重点是 哪些数据开始写入磁盘由

Note: The current implementation of RETURN NEXT and RETURN QUERY stores the entire result set before returning from the function, as discussed above. That means that if a PL/pgSQL function produces a very large result set, performance might be poor: data will be written to disk to avoid memory exhaustion, but the function itself will not return until the entire result set has been generated. A future version of PL/pgSQL might allow users to define set-returning functions that do not have this limitation. Currently, the point at which data begins being written to disk is controlled by the work_mem configuration variable. Administrators who have sufficient memory to store larger result sets in memory should consider increasing this parameter.

考虑一次为一个系统计算行,或者为work_mem设置足够高的值以应付负载.请点击报价单中提供的有关work_mem的更多链接.

Consider computing rows for one system at a time or set a high enough value for work_mem to cope with the load. Follow the link provided in the quote on more about work_mem.

一种方法是使用

One way would be to set a very high value for work_mem with SET LOCAL in your function, which is only effective for for the current transaction. I added a commented line in the function. Do not set it very high globally, as this could nuke your server. Read the manual.

这篇关于连续重复/重复的有序计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆