日期不完全连续的连续日期的组记录 [英] Group records by consecutive dates when dates are not exactly consecutive

查看:113
本文介绍了日期不完全连续的连续日期的组记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些包含日期的数据。我正在尝试按照连续日期对数据进行分组,但是日期并不完全相同。这是一个例子:

  DateColumn |价值
------------------------ + -------
2017-01-18 01:12: 34.107 | 215426< - 批号1
2017-01-18 01:12:34.113 | 215636
2017-01-18 01:12:34.623 | 123516
2017-01-18 01:12:34.633 | 289926
2017-01-18 04:58:42.660 | 259063< - 批号2
2017-01-18 04:58:42.663 | 261830
2017-01-18 04:58:42.893 | 219835
2017-01-18 04:58:42.907 | 250165
2017-01-18 05:18:14.660 | 134253< - 批号3
2017-01-18 05:18:14.663 | 134257
2017-01-18 05:18:14.667 | 134372
2017-01-18 05:18:15.040 | 181679
2017-01-18 05:18:15.043 | 226368
2017-01-18 05:18:15.043 | 227070

批量生成数据,批处理中的每一行都需要几毫秒才能生成。我正在尝试将结果分组如下:

  Date1 | Date2 |计数
------------------------ + -------------------- ----- + ------
2017-01-18 01:12:34.107 | 2017-01-18 01:12:34.633 | 4
2017-01-18 04:58:42.660 | 2017-01-18 04:58:42.907 | 4
2017-01-18 05:18:14.660 | 2017-01-18 05:18:15.043 | 6

可以假设如果两个连续的行超过1分钟,那么它们属于一个不同的批次。



我尝试了涉及 ROW_NUMBER 函数的解决方案,但它们可以连续使用(两行之间的日期差异是固定的)。当差异模糊时,如何达到预期的效果?






请注意,一批可能会比一分钟。例如,批次可能包含从2017-01-01 00:00:00开始并结束于2017-01-01 00:05:00的行,包括〜3000行和每行几十或百毫秒。可以肯定的是,批次至少需要1分钟。

解决方案

尝试这样:

  select min(t.dateColumn)date1,max(t.dateColumn)date2,count(*)
from(
select t。 *,sum(val)over(
order by t.dateColumn
)grp
from(
select t。*,case
when datediff(ms,lag t.dateColumn,1,t.dateColumn)over(
order by t.dateColumn
),t.dateColumn)> 60000
then 1
else 0
end val
from your_table t
)t
)t
group by grp;

产生:





使用分析函数 lag()标记下一批的起始关于最后一个的 datecolumn 的区别,然后使用分析 sum()来创建批次和然后将其分组以查找所需的聚合。



由于 DATETIME 的四舍五入,可能会出现一些错误分类。从






这是使用CTE重写的相同查询:

  WITH cte1(DateColumn,ValueColumn)AS(
- 插入返回datetime列和任何其他列的查询
SELECT
SomeDate,
SomeValue
FROM SomeTable
WHERE SomeColumn IS NOT NULL
),cte2 AS(
- 此查询在当前行日期 - 上一行时添加一个名为val的列,其中包含
- 1日期> 1分钟
- - 否则
SELECT
cte1。*,
CASE WHEN DATEDIFF(MS,LAG(DateColumn,1,DateColumn)OVER(ORDER BY DateColumn),DateColumn)> 60000 THEN 1 ELSE 0 END AS val
FROM cte1
),cte3 AS(
- 此查询添加一个名为grp的列,数字为
- 使用运行的组总和在val列
SELECT
cte2。*,
SUM(val)OVER(ORDER BY DateColumn)AS grp
FROM cte2

SELECT
MIN(DateColumn)Date1,
MAX(DateColumn)Date2,
COUNT(ValueColumn)[Count]
FROM cte3
GROUP BY grp


I have some data that contains dates. I'm trying to group the data by consecutive dates, however, the dates are not exactly consecutive. Here is an example:

DateColumn              | Value
------------------------+-------
2017-01-18 01:12:34.107 | 215426 <- batch no. 1
2017-01-18 01:12:34.113 | 215636
2017-01-18 01:12:34.623 | 123516
2017-01-18 01:12:34.633 | 289926
2017-01-18 04:58:42.660 | 259063 <- batch no. 2
2017-01-18 04:58:42.663 | 261830
2017-01-18 04:58:42.893 | 219835
2017-01-18 04:58:42.907 | 250165
2017-01-18 05:18:14.660 | 134253 <- batch no. 3
2017-01-18 05:18:14.663 | 134257
2017-01-18 05:18:14.667 | 134372
2017-01-18 05:18:15.040 | 181679
2017-01-18 05:18:15.043 | 226368
2017-01-18 05:18:15.043 | 227070

The data is generated in batches and each row inside a batch takes a few milliseconds to generate. I'm trying to group the results as follows:

Date1                   | Date2                   | Count
------------------------+-------------------------+------
2017-01-18 01:12:34.107 | 2017-01-18 01:12:34.633 | 4
2017-01-18 04:58:42.660 | 2017-01-18 04:58:42.907 | 4
2017-01-18 05:18:14.660 | 2017-01-18 05:18:15.043 | 6

It is safe to assume that if two consecutive rows are more than 1 minute apart then they belong to a different batch.

I tried solutions involving ROW_NUMBER function but they work with consecutive dates (date difference between two rows is fixed). How can I achieve desired result when the difference is fuzzy?


Please note that a batch could be much longer than a minute. For example a batch might consist of rows starting from 2017-01-01 00:00:00 and ending at 2017-01-01 00:05:00 consisting of ~3000 rows and each row few dozen or hundred millisecond apart. What is for certain is that batches are at least 1 minute apart.

解决方案

Try this:

select min(t.dateColumn) date1, max(t.dateColumn) date2, count(*)
from (
    select t.*, sum(val) over (
            order by t.dateColumn
            ) grp
    from (
        select t.*, case 
                when datediff(ms, lag(t.dateColumn, 1, t.dateColumn) over (
                            order by t.dateColumn
                            ), t.dateColumn) > 60000
                    then 1
                else 0
                end val
        from your_table t
        ) t
    ) t
group by grp;

Produces:

uses the analytic function lag() to mark starting of next batch based on the difference of datecolumn from the last one and then use analytic sum() on it to create group of batches and then group by it to find required aggregates.

There may be some misclassification in groups due to rounding issues with DATETIME. From MSDN,

datetime values are rounded to increments of .000, .003, or .007 seconds, as shown in the following table.


Here is the same query rewritten using CTEs:

WITH cte1(DateColumn, ValueColumn) AS (
    -- Insert your query that returns a datetime column and any other column
    SELECT
        SomeDate,
        SomeValue
    FROM SomeTable
    WHERE SomeColumn IS NOT NULL
), cte2 AS (
    -- This query adds a column called "val" that contains
    -- 1 when current row date - previous row date > 1 minute
    -- 0 otherwise
    SELECT
        cte1.*,
        CASE WHEN DATEDIFF(MS, LAG(DateColumn, 1, DateColumn) OVER (ORDER BY DateColumn), DateColumn) > 60000 THEN 1 ELSE 0 END AS val
    FROM cte1
), cte3 AS (
    -- This query adds a column called "grp" that numbers 
    -- the groups using running sum over the "val" column
    SELECT
        cte2.*,
        SUM(val) OVER (ORDER BY DateColumn) AS grp
    FROM cte2
)
SELECT
    MIN(DateColumn) Date1,
    MAX(DateColumn) Date2,
    COUNT(ValueColumn) [Count]
FROM cte3
GROUP BY grp

这篇关于日期不完全连续的连续日期的组记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆