检测和删除时间序列中的间隔 [英] Detect and delete gaps in time series
问题描述
我在数据集中有不同公司的每日时间序列,并使用PostgreSQL。我的目标是排除时间序列不完整的公司。因此,我想排除所有连续缺失值等于或大于3的公司。此外,我想排除所有在数据集中的第一个日期和最后一个日期之间的值缺失50%以上的公司。
I have daily time series for different companies in my dataset and work with PostgreSQL. My goal is to exclude companies with too incomplete time series. Therefor I want to exclude all companies which have 3 or more consecutive missing values. Furthermore I want to exclude all companies which have more than 50% missing values between their first and final date in the dataset.
我们可以使用以下示例数据: / p>
We can work with the following example data:
date company value
2012-01-01 A 5
2012-01-01 B 2
2012-01-02 A NULL
2012-01-02 B 2
2012-01-02 C 4
2012-01-03 A NULL
2012-01-03 B NULL
2012-01-03 C NULL
2012-01-04 A NULL
2012-01-04 B NULL
2012-01-04 C NULL
2012-01-05 A 8
2012-01-05 B 9
2012-01-05 C 3
2012-01-06 A 8
2012-01-06 B 9
2012-01-06 C NULL
因此必须排除A,因为它具有三个连续缺失值的间隔,并且C因为它的第一个日期和最后一个日期之间的缺失值超过50%。
So A has to be excluded because it has a gap of three consecutive missing values, and C because it has more than 50% missing values between its first and final date.
结合此论坛中的其他答案,我编写了以下代码:
Combining other answers in this forum I made up the following code:
-
添加自动增量主键以标识每一行
Add an autoincrement primary key to identify each row
CREATE TABLE test AS SELECT * FROM mytable ORDER BY company, date;
CREATE SEQUENCE id_seq; ALTER TABLE test ADD id INT UNIQUE;
ALTER TABLE test ALTER COLUMN id SET DEFAULT NEXTVAL('id_seq');
UPDATE test SET id = NEXTVAL('id_seq');
ALTER TABLE test ADD PRIMARY KEY (id);
检测时间序列中的间隔
Detect the gaps in the time series
CREATE TABLE to_del AS WITH count3 AS
( SELECT *,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY id
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)
AS cnt FROM test)
SELECT company, id FROM count3 WHERE cnt >= 3;
删除mytable中的差距
Delete the gaps from mytable
DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_del);
似乎可以检测并删除差距时间序列中的3个或更多连续缺失值。但是这种方法非常麻烦。而且我无法弄清楚如何将所有遗漏价值超过50%的公司加进去。
It seems to achieve to detect and delete gaps of 3 or more consecutive missing values from the time series. But this approach is very cumbersome. And I can't figure out how to additinoally exclude all companies with more than 50% missing values.
您能想到比我更有效的解决方案吗(我只是学习(与PostgreSQL一起使用),这还设法排除了缺失值超过50%的公司?
Can you think of a more effective solution than mine (I just learn to work with PostgreSQL), that also manages to exclude companies with more than 50% missing values?
推荐答案
我只会创建一个查询:
DELETE FROM mytable
WHERE company in (
SELECT Company
FROM (
SELECT Company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY id
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company)
/
COUNT(*)
OVER (PARTITION BY company) As p50
) alias
WHERE cnt >= 3 OR p50 > 0.5
)
(公司+价值)列上的复合索引可以
A composite index on (company + value) columns can help to gain a maximum speed of this query.
EDIT
上述查询不起作用
我已对其进行了更正,下面是一个演示: http://sqlfiddle.com/#!15/c9bfe/7
已更改两件事:
-按公司划分的 ORDER BY日期,而不是 ORDER BY ID
-显式转换为数字(因为整数已被截断为0):
以上(按公司划分) ::数字
The above query doesn't work
I've corrected it slightly, here is a demo: http://sqlfiddle.com/#!15/c9bfe/7
Two things have been changed:
- PARTITION BY company ORDER BY date instead of ORDER BY id
- explicit cast to numeric( because integer have been truncated to 0):
OVER (PARTITION BY company)::numeric
SELECT company, cnt, p50
FROM (
SELECT company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY date
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY company)::numeric
/
COUNT(*)
OVER (PARTITION BY company) As p50
FROM mytable
) alias
-- WHERE cnt >= 3 OR p50 > 0.5
现在删除查询应该可以正常工作:
and now the delete query should work:
DELETE FROM mytable
WHERE company in (
SELECT company
FROM (
SELECT company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY date
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY company)::numeric
/
COUNT(*)
OVER (PARTITION BY company) As p50
FROM mytable
) alias
WHERE cnt >= 3 OR p50 > 0.5
)
这篇关于检测和删除时间序列中的间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!