使用T-SQL汇总仅相邻记录 [英] Aggregate adjacent only records with T-SQL

查看:114
本文介绍了使用T-SQL汇总仅相邻记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个(带有示例数据的)表格,其中包含以下数据

I have (simplified for the example) a table with the following data

Row Start       Finish       ID  Amount
--- ---------   ----------   --  ------
  1 2008-10-01  2008-10-02   01      10
  2 2008-10-02  2008-10-03   02      20
  3 2008-10-03  2008-10-04   01      38
  4 2008-10-04  2008-10-05   01      23
  5 2008-10-05  2008-10-06   03      14
  6 2008-10-06  2008-10-07   02       3
  7 2008-10-07  2008-10-08   02       8
  8 2008-10-08  2008-11-08   03      19

日期代表一个时期在一段时间内,ID是该系统在该时期内所处的状态,金额是与该状态有关的值。

The dates represent a period in time, the ID is the state a system was in during that period and the amount is a value related to that state.

我要做的是汇总ID编号为相同相邻行的数量,但保持相同的整体顺序,以便可以合并连续的运行。因此,我想以以下数据结尾:

What I want to do is to aggregate the Amounts for adjacent rows with the same ID number, but keep the same overall sequence so that contiguous runs can be combined. Thus I want to end up with data like:

Row Start       Finish       ID  Amount
--- ---------   ----------   --  ------
  1 2008-10-01  2008-10-02   01      10
  2 2008-10-02  2008-10-03   02      20
  3 2008-10-03  2008-10-05   01      61
  4 2008-10-05  2008-10-06   03      14
  5 2008-10-06  2008-10-08   02      11
  6 2008-10-08  2008-11-08   03      19

我追求可以放入SP中的T-SQL解决方案,但是我看不到如何通过简单的查询来做到这一点。我怀疑这可能需要某种迭代,但是我不想走这条路。

I am after a T-SQL solution that can be put into a SP, however I can't see how to do that with simple queries. I suspect that it may require iteration of some sort but I don't want to go down that path.

我要进行此聚合的原因是下一步在此过程中,将执行SUM()和Count()并按序列中出现的唯一ID进行分组,这样我的最终数据将类似于:

The reason I want to do this aggregation is that the next step in the process is to do a SUM() and Count() grouped by the unique ID's that occur within the sequence, so that my final data will look something like:

ID  Counts Total
--  ------ -----
01       2    71
02       2    31
03       2    33

但是如果我做一个简单的

However if I do a simple

SELECT COUNT(ID), SUM(Amount) FROM data GROUP BY ID

在原始表上,我得到类似

On the original table I get something like

ID  Counts Total
--  ------ -----
01       3    71
02       3    31
03       2    33

这不是我想要的。

推荐答案

如果您读过《用以下语言开发面向时间的数据库应用程序》一书SQL, RT Snodgrass (可从他的网站上的出版物中获得pdf格式),在p165-166上的6.25中,您会发现非平凡的SQL,可以在当前示例中使用该SQL来对具有相同ID值和连续时间间隔的各行进行分组。

If you read the book "Developing Time-Oriented Database Applications in SQL" by R T Snodgrass (the pdf of which is available from his web site under publications), and get as far as Figure 6.25 on p165-166, you will find the non-trivial SQL which can be used in the current example to group the various rows with the same ID value and continuous time intervals.

下面的查询开发已接近正确,但是最后发现一个问题,该问题的来源在第一个SELECT语句中。我还没有找到给出错误答案的原因。 [如果有人可以在其DBMS上测试SQL并告诉我第一个查询在那里是否正常工作,那将是很大的帮助!]

The query development below is close to correct, but there is a problem spotted right at the end, that has its source in the first SELECT statement. I've not yet tracked down why the incorrect answer is being given. [If someone can test the SQL on their DBMS and tell me whether the first query works correctly there, it would be a great help!]

看起来像:

-- Derived from Figure 6.25 from Snodgrass "Developing Time-Oriented
-- Database Applications in SQL"
CREATE TABLE Data
(
    Start   DATE,
    Finish  DATE,
    ID      CHAR(2),
    Amount  INT
);

INSERT INTO Data VALUES('2008-10-01', '2008-10-02', '01', 10);
INSERT INTO Data VALUES('2008-10-02', '2008-10-03', '02', 20);
INSERT INTO Data VALUES('2008-10-03', '2008-10-04', '01', 38);
INSERT INTO Data VALUES('2008-10-04', '2008-10-05', '01', 23);
INSERT INTO Data VALUES('2008-10-05', '2008-10-06', '03', 14);
INSERT INTO Data VALUES('2008-10-06', '2008-10-07', '02',  3);
INSERT INTO Data VALUES('2008-10-07', '2008-10-08', '02',  8);
INSERT INTO Data VALUES('2008-10-08', '2008-11-08', '03', 19);

SELECT DISTINCT F.ID, F.Start, L.Finish
    FROM Data AS F, Data AS L
    WHERE F.Start < L.Finish
      AND F.ID = L.ID
      -- There are no gaps between F.Finish and L.Start
      AND NOT EXISTS (SELECT *
                        FROM Data AS M
                        WHERE M.ID = F.ID
                        AND F.Finish < M.Start
                        AND M.Start < L.Start
                        AND NOT EXISTS (SELECT *
                                            FROM Data AS T1
                                            WHERE T1.ID = F.ID
                                              AND T1.Start <  M.Start
                                              AND M.Start  <= T1.Finish))
      -- Cannot be extended further
      AND NOT EXISTS (SELECT *
                          FROM Data AS T2
                          WHERE T2.ID = F.ID
                            AND ((T2.Start <  F.Start  AND F.Start  <= T2.Finish)
                              OR (T2.Start <= L.Finish AND L.Finish <  T2.Finish)));

该查询的输出为:

01  2008-10-01      2008-10-02
01  2008-10-03      2008-10-05
02  2008-10-02      2008-10-03
02  2008-10-06      2008-10-08
03  2008-10-05      2008-10-06
03  2008-10-05      2008-11-08
03  2008-10-08      2008-11-08

已编辑:倒数第二行的问题-它不应该在那里。而且我还不清楚(现在)它来自哪里。

Edited: There's a problem with the penultimate row - it should not be there. And I'm not clear (yet) where it is coming from.

现在我们需要在另一个SELECT语句的FROM子句中将该复杂表达式视为查询表达式。 ,它将与上面显示的最大范围重叠的条目上的给定ID的金额值相加。

Now we need to treat that complex expression as a query expression in the FROM clause of another SELECT statement, which will sum the amount values for a given ID over the entries that overlap with the maximal ranges shown above.

SELECT M.ID, M.Start, M.Finish, SUM(D.Amount)
    FROM Data AS D,
         (SELECT DISTINCT F.ID, F.Start, L.Finish
              FROM Data AS F, Data AS L
              WHERE F.Start < L.Finish
                AND F.ID = L.ID
                -- There are no gaps between F.Finish and L.Start
                AND NOT EXISTS (SELECT *
                                    FROM Data AS M
                                    WHERE M.ID = F.ID
                                    AND F.Finish < M.Start
                                    AND M.Start < L.Start
                                    AND NOT EXISTS (SELECT *
                                                        FROM Data AS T1
                                                        WHERE T1.ID = F.ID
                                                          AND T1.Start <  M.Start
                                                          AND M.Start  <= T1.Finish))
                  -- Cannot be extended further
                AND NOT EXISTS (SELECT *
                                    FROM Data AS T2
                                    WHERE T2.ID = F.ID
                                      AND ((T2.Start <  F.Start  AND F.Start  <= T2.Finish)
                                        OR (T2.Start <= L.Finish AND L.Finish <  T2.Finish)))) AS M
    WHERE D.ID = M.ID
      AND M.Start  <= D.Start
      AND M.Finish >= D.Finish
    GROUP BY M.ID, M.Start, M.Finish
    ORDER BY M.ID, M.Start;

这给出了:

ID  Start        Finish       Amount
01  2008-10-01   2008-10-02   10
01  2008-10-03   2008-10-05   61
02  2008-10-02   2008-10-03   20
02  2008-10-06   2008-10-08   11
03  2008-10-05   2008-10-06   14
03  2008-10-05   2008-11-08   33              -- Here be trouble!
03  2008-10-08   2008-11-08   19

已编辑:这几乎是执行原始问题所要求的COUNT和SUM汇总的正确数据集,因此最终答案是:

Edited: This is almost the correct data set on which to do the COUNT and SUM aggregation requested by the original question, so the final answer is:

SELECT I.ID, COUNT(*) AS Number, SUM(I.Amount) AS Amount
    FROM (SELECT M.ID, M.Start, M.Finish, SUM(D.Amount) AS Amount
            FROM Data AS D,
                 (SELECT DISTINCT F.ID, F.Start, L.Finish
                      FROM  Data AS F, Data AS L
                      WHERE F.Start < L.Finish
                        AND F.ID = L.ID
                        -- There are no gaps between F.Finish and L.Start
                        AND NOT EXISTS
                            (SELECT *
                                FROM  Data AS M
                                WHERE M.ID = F.ID
                                  AND F.Finish < M.Start
                                  AND M.Start < L.Start
                                  AND NOT EXISTS
                                      (SELECT *
                                          FROM Data AS T1
                                          WHERE T1.ID = F.ID
                                            AND T1.Start <  M.Start
                                            AND M.Start  <= T1.Finish))
                          -- Cannot be extended further
                        AND NOT EXISTS
                            (SELECT *
                                FROM  Data AS T2
                                WHERE T2.ID = F.ID
                                  AND ((T2.Start <  F.Start  AND F.Start  <= T2.Finish) OR
                                       (T2.Start <= L.Finish AND L.Finish <  T2.Finish)))
                 ) AS M
            WHERE D.ID = M.ID
              AND M.Start  <= D.Start
              AND M.Finish >= D.Finish
            GROUP BY M.ID, M.Start, M.Finish
          ) AS I
        GROUP BY I.ID
        ORDER BY I.ID;

id     number  amount
01      2      71
02      2      31
03      3      66

评论
哦! Drat ... 3的条目具有应有的数量的两倍。之前的已编辑部分会指出问题出在哪里。似乎第一个查询是错误的(可能是针对另一个问题),或者我正在使用的优化程序行为异常。但是,应该有一个与此相关的答案,它将给出正确的值。

Review: Oh! Drat...the entry for 3 has twice the 'amount' that it should have. Previous 'edited' parts indicate where things started to go wrong. It looks as though either the first query is subtly wrong (maybe it is intended for a different question), or the optimizer I'm working with is misbehaving. Nevertheless, there should be an answer closely related to this that will give the correct values.

记录如下:在Solaris 10上的IBM Informix Dynamic Server 11.50上进行了测试。应该可以在其他任何符合标准的SQL DBMS上正常工作。

For the record: tested on IBM Informix Dynamic Server 11.50 on Solaris 10. However, should work fine on any other moderately standard-conformant SQL DBMS.

这篇关于使用T-SQL汇总仅相邻记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆