每天按时间序列进行汇总,而无需使用非等分逻辑 [英] Aggregate for each day over time series, without using non-equijoin logic

查看:114
本文介绍了每天按时间序列进行汇总,而无需使用非等分逻辑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出以下与日期表配对的数据集:

Given the following dataset paired with a dates table:

MembershipId | ValidFromDate | ValidToDate
==========================================
0001         | 1997-01-01    | 2006-05-09
0002         | 1997-01-01    | 2017-05-12
0003         | 2005-06-02    | 2009-02-07

在任何给定的日期或时间序列中有多少个Memberships打开?

How many Memberships were open on any given day or timeseries of days?

在询问此问题之后此处,此答案提供了必要的功能:

Following this question being asked here, this answer provided the necessary functionality:

select d.[Date]
      ,count(m.MembershipID) as MembershipCount
from DIM.[Date] as d
    left join Memberships as m
        on(d.[Date] between m.ValidFromDateKey and m.ValidToDateKey)
where d.CalendarYear = 2016
group by d.[Date]
order by d.[Date];

尽管有评论者指出,当非等价连接花费的时间太长时,还有其他方法.

这样,仅等值联接逻辑将如何复制上面查询的输出?

As such, what would the equijoin only logic look like to replicate the output of the query above?

从到目前为止提供的答案中,我提出了以下内容,这些内容在我正在使用的320万条Membership记录所使用的硬件上胜过:

From the answers provided so far I have come up with the below, which outperforms on the hardware I am working with across 3.2 million Membership records:

declare @s date = '20160101';
declare @e date = getdate();

with s as
(
    select d.[Date] as d
        ,count(s.MembershipID) as s
    from dbo.Dates as d
        join dbo.Memberships as s
            on d.[Date] = s.ValidFromDateKey
    group by d.[Date]
)
,e as
(
    select d.[Date] as d
        ,count(e.MembershipID) as e
    from dbo.Dates as d
        join dbo.Memberships as e
            on d.[Date] = e.ValidToDateKey
    group by d.[Date]
),c as
(
    select isnull(s.d,e.d) as d
            ,sum(isnull(s.s,0) - isnull(e.e,0)) over (order by isnull(s.d,e.d)) as c
    from s
        full join e
            on s.d = e.d
)
select d.[Date]
    ,c.c
from dbo.Dates as d
    left join c
        on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
;

接着,将这一汇总每天划分为组成组,我得到以下结果,效果也很好:

Following on from that, to split this aggregate into constituent groups per day I have the following, which is also performing well:

declare @s date = '20160101';
declare @e date = getdate();

with s as
(
    select d.[Date] as d
        ,s.MembershipGrouping as g
        ,count(s.MembershipID) as s
    from dbo.Dates as d
        join dbo.Memberships as s
            on d.[Date] = s.ValidFromDateKey
    group by d.[Date]
            ,s.MembershipGrouping
)
,e as
(
    select d.[Date] as d
        ,e..MembershipGrouping as g
        ,count(e.MembershipID) as e
    from dbo.Dates as d
        join dbo.Memberships as e
            on d.[Date] = e.ValidToDateKey
    group by d.[Date]
            ,e.MembershipGrouping
),c as
(
    select isnull(s.d,e.d) as d
            ,isnull(s.g,e.g) as g
            ,sum(isnull(s.s,0) - isnull(e.e,0)) over (partition by isnull(s.g,e.g) order by isnull(s.d,e.d)) as c
    from s
        full join e
            on s.d = e.d
                and s.g = e.g
)
select d.[Date]
    ,c.g
    ,c.c
from dbo.Dates as d
    left join c
        on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
        ,c.g
;

任何人都可以在上述方面进行改进吗?

Can anyone improve on the above?

推荐答案

假设您的日期维度包含所有成员资格期间中包含的所有日期,则可以使用类似以下的内容.

On the assumption your date dimension contains all dates contained in all membership periods you can use something like the following.

该联接是一个等值联接,因此可以使用哈希联接或合并联接,而不仅是嵌套循环(对于每个外部行,内部子树将执行一次).

The join is an equi join so can use hash join or merge join not just nested loops (which will execute the inside sub tree once for each outer row).

假设(ValidToDate) include(ValidFromDate)上的索引或反转索引,可以对Memberships使用搜索,并对日期维度进行扫描.对于拥有320万会员和140万普通会员的表,下面的返回时间不到一秒钟,对于我来说,返回一年的结果(脚本)

Assuming index on (ValidToDate) include(ValidFromDate) or reverse this can use a single seek against Memberships and a single scan of the date dimension. The below has an elapsed time of less than a second for me to return the results for a year against a table with 3.2 million members and general active membership of 1.4 million (script)

DECLARE @StartDate DATE = '2016-01-01',
        @EndDate   DATE = '2016-12-31';

WITH MD
     AS (SELECT Date,
                SUM(Adj) AS MemberDelta
         FROM   Memberships
                CROSS APPLY (VALUES ( ValidFromDate, +1),
                                    --Membership count decremented day after the ValidToDate
                                    (DATEADD(DAY, 1, ValidToDate), -1) ) V(Date, Adj)
         WHERE
          --Members already expired before the time range of interest can be ignored
          ValidToDate >= @StartDate
          AND
          --Members whose membership starts after the time range of interest can be ignored
          ValidFromDate <= @EndDate
         GROUP  BY Date),
     MC
     AS (SELECT DD.DateKey,
                SUM(MemberDelta) OVER (ORDER BY DD.DateKey ROWS UNBOUNDED PRECEDING) AS CountOfNonIgnoredMembers
         FROM   DIM_DATE DD
                LEFT JOIN MD
                  ON MD.Date = DD.DateKey)
SELECT DateKey,
       CountOfNonIgnoredMembers AS MembershipCount
FROM   MC
WHERE  DateKey BETWEEN @StartDate AND @EndDate 
ORDER BY DateKey

演示(使用扩展期作为日历年示例数据对2016年的情况不是很有趣)

Demo (uses extended period as the calendar year of 2016 isn't very interesting with the example data)

这篇关于每天按时间序列进行汇总,而无需使用非等分逻辑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆