在表格中填写缺少日期的数据(PostgreSQL,Redshift) [英] Fill the table with data for missing date (postgresql, redshift)

查看:104
本文介绍了在表格中填写缺少日期的数据(PostgreSQL,Redshift)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为缺少的日期填写每日数据,找不到答案,请帮忙。

I'm trying to fill daily data for missing dates and can not find an answer, please help.

我的 daily_table 示例:

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |          

预期结果:我希望用每个域的数据填充此表,每天只复制数据从上一个日期

Expected result: I wand to fill this table with data for every domain and every day which just copy data from previous date:

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-14    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-15    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-16    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |          

我可以将逻辑的一部分移入php,但这是不可取的,因为我的表有数十亿缺少日期。

I can move a part of the logic into php, but it is undesirable, because my table has billions of missing dates.

摘要:

在最后几天,我发现


  1. Amazon-redshift可与PostgreSql的第8版配合使用,这就是为什么它不支持像<$这样的漂亮命令的原因c $ c> JOIN LATERAL

  2. Redshift还不支持 generate_series CTE

  3. 但是它支持简单的 WITH (谢谢@systemjack),但使用递归不会

  1. Amazon-redshift works with 8-th version of PostgreSql, that's why it does not support such a beautiful command like JOIN LATERAL
  2. Redshift also does not support generate_series and CTEs
  3. But it supports simple WITH (thank you @systemjack) but WITH RECURSIVE does not


推荐答案

最后,我完成了任务我想分享一些有用的东西。

Finally, I finished my task and I want to share some useful things.

我使用了这个钩子,而不是 generate_series

Instead of generate_series I used this hook:

WITH date_range AS (
  SELECT trunc(current_date - (row_number() OVER ())) AS date
  FROM any_table  -- any of your table which has enough data
  LIMIT 365
) SELECT * FROM date_range;

要获取URL列表,我必须用我使用的数据进行填充:

To get list of URLs which I have to fill with the data I used this:

WITH url_list AS (
  SELECT
    url AS gapsed_url,
    MIN(timestamp_gmt) AS min_date,
    MAX(timestamp_gmt) AS max_date
  FROM daily_table
  WHERE url IN (
    SELECT url FROM daily_table GROUP BY url
    HAVING count(url) < (MAX(timestamp_gmt) - MIN(timestamp_gmt) + 1)
  )
  GROUP BY url
) SELECT * FROM url_list;

然后我合并给定的数据,我们称之为 url_mapping

Then I combinet given data, let's call it url_mapping:

SELECT t1.*, t2.gapsed_url FROM date_range AS t1 CROSS JOIN url_list AS t2
WHERE t1.date <= t2.max_date AND t1.date >= t2.min_date;

要获得最接近日期的数据,我做了以下操作:

And to get data by closest date I did the following:

SELECT sd.*
FROM url_mapping AS um JOIN daily_table AS sd
ON um.gapsed_url = sd.url AND (
  sd.timestamp_gmt = (SELECT max(timestamp_gmt) FROM daily_table WHERE url = sd.url AND timestamp_gmt <= um.date)
)

我希望它会帮助某人。

这篇关于在表格中填写缺少日期的数据(PostgreSQL,Redshift)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆