在表格中填写缺少日期的数据(PostgreSQL,Redshift) [英] Fill the table with data for missing date (postgresql, redshift)
问题描述
我正在尝试为缺少的日期填写每日数据,找不到答案,请帮忙。
I'm trying to fill daily data for missing dates and can not find an answer, please help.
我的 daily_table
示例:
url | timestamp_gmt | visitors | hits | other..
-------------------+---------------+----------+-------+-------
www.domain.com/1 | 2016-04-12 | 1231 | 23423 |
www.domain.com/1 | 2016-04-13 | 1374 | 26482 |
www.domain.com/1 | 2016-04-17 | 1262 | 21493 |
www.domain.com/2 | 2016-05-09 | 2345 | 35471 |
预期结果:我希望用每个域的数据填充此表,每天只复制数据从上一个日期
:
Expected result: I wand to fill this table with data for every domain and every day which just copy data from previous date
:
url | timestamp_gmt | visitors | hits | other..
-------------------+---------------+----------+-------+-------
www.domain.com/1 | 2016-04-12 | 1231 | 23423 |
www.domain.com/1 | 2016-04-13 | 1374 | 26482 |
www.domain.com/1 | 2016-04-14 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-15 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-16 | 1374 | 26482 | <-added
www.domain.com/1 | 2016-04-17 | 1262 | 21493 |
www.domain.com/2 | 2016-05-09 | 2345 | 35471 |
我可以将逻辑的一部分移入php,但这是不可取的,因为我的表有数十亿缺少日期。
I can move a part of the logic into php, but it is undesirable, because my table has billions of missing dates.
摘要:
在最后几天,我发现
- Amazon-redshift可与PostgreSql的第8版配合使用,这就是为什么它不支持像<$这样的漂亮命令的原因c $ c> JOIN LATERAL
- Redshift还不支持
generate_series
和CTE
- 但是它支持简单的
WITH
(谢谢@systemjack),但使用递归
不会
- Amazon-redshift works with 8-th version of PostgreSql, that's why it does not support such a beautiful command like
JOIN LATERAL
- Redshift also does not support
generate_series
andCTEs
- But it supports simple
WITH
(thank you @systemjack) butWITH RECURSIVE
does not
推荐答案
最后,我完成了任务我想分享一些有用的东西。
Finally, I finished my task and I want to share some useful things.
我使用了这个钩子,而不是 generate_series
:
Instead of generate_series
I used this hook:
WITH date_range AS (
SELECT trunc(current_date - (row_number() OVER ())) AS date
FROM any_table -- any of your table which has enough data
LIMIT 365
) SELECT * FROM date_range;
要获取URL列表,我必须用我使用的数据进行填充:
To get list of URLs which I have to fill with the data I used this:
WITH url_list AS (
SELECT
url AS gapsed_url,
MIN(timestamp_gmt) AS min_date,
MAX(timestamp_gmt) AS max_date
FROM daily_table
WHERE url IN (
SELECT url FROM daily_table GROUP BY url
HAVING count(url) < (MAX(timestamp_gmt) - MIN(timestamp_gmt) + 1)
)
GROUP BY url
) SELECT * FROM url_list;
然后我合并给定的数据,我们称之为 url_mapping
:
Then I combinet given data, let's call it url_mapping
:
SELECT t1.*, t2.gapsed_url FROM date_range AS t1 CROSS JOIN url_list AS t2
WHERE t1.date <= t2.max_date AND t1.date >= t2.min_date;
要获得最接近日期的数据,我做了以下操作:
And to get data by closest date I did the following:
SELECT sd.*
FROM url_mapping AS um JOIN daily_table AS sd
ON um.gapsed_url = sd.url AND (
sd.timestamp_gmt = (SELECT max(timestamp_gmt) FROM daily_table WHERE url = sd.url AND timestamp_gmt <= um.date)
)
我希望它会帮助某人。
这篇关于在表格中填写缺少日期的数据(PostgreSQL,Redshift)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!