SQL - Vertica:如何使用大多数以前的日期数据生成每日行 [英] SQL - Vertica: How to generate daily rows with most previous date data

查看:41
本文介绍了SQL - Vertica:如何使用大多数以前的日期数据生成每日行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的基表:

I have a base table like below:

score_upd (Upd_dt,Url,Score) AS (
          SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
)

   Upd_dt       URL    Score
 2019-07-26      A       x
 2019-07-26      B      alpha 
 2019-08-01      A       y
 2019-08-01      B      beta
 2019-08-03      A       z
 2019-08-03      B      gamma

我想在每日 url 级别创建一个表,使用新行的大多数前一个日期的值,结果应如下所示:

And I want to create a table in daily-url level, using most previous date's value for the new rows, result should look like below:

score_upd (Upd_dt,Url,Score) AS (
          SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-07-27','A','x'
UNION ALL SELECT DATE '2019-07-27','B','alpha'
UNION ALL SELECT DATE '2019-07-28','A','x'
UNION ALL SELECT DATE '2019-07-28','B','alpha'
UNION ALL SELECT DATE '2019-07-29','A','x'
UNION ALL SELECT DATE '2019-07-29','B','alpha'
UNION ALL SELECT DATE '2019-07-30','A','x'
UNION ALL SELECT DATE '2019-07-30','B','alpha'
UNION ALL SELECT DATE '2019-07-31','A','x'
UNION ALL SELECT DATE '2019-07-31','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-02','A','y'
UNION ALL SELECT DATE '2019-08-02','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
UNION ALL SELECT DATE '2019-08-04','A','z'
UNION ALL SELECT DATE '2019-08-04','B','gamma'
UNION ALL SELECT DATE '2019-08-05','A','z'
UNION ALL SELECT DATE '2019-08-05','B','gamma'
) 

看起来像:

   Upd_dt       URL    Score 
 2019-07-26      A       x
 2019-07-26      B      alpha 
 2019-07-27      A       x
 2019-07-27      B      alpha 
 2019-07-28      A       x
 2019-07-28      B      alpha 
 2019-07-29      A       x
 2019-07-29      B      alpha 
 2019-07-30      A       x
 2019-07-30      B      alpha 
 2019-07-31      A       x
 2019-07-31      B      alpha 
 2019-08-01      A       y
 2019-08-01      B      beta
 2019-08-02      A       y
 2019-08-02      B      beta
 2019-08-03      A       z
 2019-08-03      B      gamma
 2019-08-04      A       z
 2019-08-04      B      gamma
 2019-08-05      A       z
 2019-08-05      B      gamma
.
.
.

目前的流程是:从 2019 年 7 月 26 日至今,我通过以下方式构建了每日维度表:

Current process is: I built a daily dimension table since 7/26/2019 till today by:

/*SELECT CAST(slice_time AS DATE) 日期从 testcalendar mtcTIMESERIES slice_time 为1 天"结束 (ORDER BY CAST(mtc.dates as TIMESTAMP));*/

/* SELECT CAST(slice_time AS DATE) dates FROM testcalendar mtc TIMESERIES slice_time as '1 day' OVER (ORDER BY CAST(mtc.dates as TIMESTAMP)); */

所以我得到:

日期

2019-07-26

2019-07-26

2019-07-27

2019-07-27

2019-07-28

2019-07-28

2019-07-29

2019-07-29

.

.

.

2019-10-12(今天)

2019-10-12 (today)

我在考虑是否可以使用诸如插入先前值"之类的函数来按日期加入我的第一个表,通过使用来自大多数先前日期数据的值来生成缺失的天数,但它失败了.

I'm thinking if I can use function such as "interpolate previous value" to join my first table by dates, to generate missing days by using values from most previous date data, while it failed.

结果没有生成缺失天数的行.

The result didn't generate rows for missing days.

如果有人对此有更好的想法,请告诉我.

Please let me know if anyone has any better idea on this.

谢谢!

推荐答案

作为一个开始警告:只在真正需要时才存储每日照片".在我过去,我曾经每年有 364 行太多,因为这些值每年只改变一次.在 Vertica 中,加入和分组需要许可证、CPU 和时钟时间......

As a starting warning : only store a "daily photograph" when it really, really is necessary. In my past, I once ended up having 364 rows too many per year, as the values only changed once a year. In Vertica, that costs license, and CPU and clock time for joining and grouping ...

但是,其余的 - 良好的开端.

But, for the rest - Good start.

但是您可以应用 TIMESERIES 而无需构建日历.

But you could apply the TIMESERIES without having to build a calendar.

诀窍是手动外推"您可以自动INTERPOLATE的内容.

The trick is to "extrapolate" manually what you can INTERPOLATE automatically.

添加一个内嵌的填充"表,其中包含每个 URL 的最新值,但给它 CURRENT_DATE 而不是最新的实际日期 - 使用 Vertica 特有的分析限制子句 LIMIT 1 OVER(PARTITION BY url ORDER BY upd_dt DESC) .

Add an in-line 'padding' table, which contains the newest value per URL, but give it CURRENT_DATE instead of the newest actual date - using Vertica's peculiar analytic limit clause LIMIT 1 OVER(PARTITION BY url ORDER BY upd_dt DESC) .

UNION SELECT 使用您的输入填充表,并将 TIMESERIES 子句应用于该 UNION SELECT.

UNION SELECT that padding table with your input, and apply the TIMESERIES clause to that UNION SELECT.

像这样:

WITH
-- your input ...
score_upd (Upd_dt,Url,Score) AS (
          SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
)
-- real WITH clause would start here ...                                                                                                                                                 
,
-- newest row per Url, just with current date
pad_newest AS (
SELECT
  CURRENT_DATE
, url 
, score
FROM score_upd
LIMIT 1 OVER(PARTITION BY url ORDER BY upd_dt DESC)
)   
,   
with_newest AS (
SELECT
  *   
FROM score_upd
UNION ALL 
SELECT *
FROM pad_newest
)   
SELECT
  ts_dt::DATE           AS upd_dt
, url                   AS url 
, TS_FIRST_VALUE(score) AS score
FROM with_newest
TIMESERIES ts_dt AS '1 day' OVER (
  PARTITION BY url ORDER BY upd_dt::TIMESTAMP
)   
ORDER BY 1,2 
;   

这篇关于SQL - Vertica:如何使用大多数以前的日期数据生成每日行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆