离岛与缺口问题 [英] Islands and Gaps Issue

查看:99
本文介绍了离岛与缺口问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景故事:我有一个数据库,其中包含卡车司机的数据点,其中也包含.在卡车上时,驾驶员可以具有"driverstatus"状态.我要做的是按驾驶员,卡车将这些状态分组.

到目前为止,我已经尝试使用LAG/LEAD来提供帮助.这样做的原因是这样,我可以告诉驾驶员状态何时发生更改,然后可以将该行标记为具有该状态的最后日期时间.

这本身是不够的,因为我需要按状态和日期对状态进行分组.为此,我有了诸如DENSE_RANK之类的东西,但是我无法设法获得有关ORDER BY子句的正确信息.

这是我的测试数据,这是我中许多人排名排名的尝试.

/****** Script for SelectTopNRows command from SSMS  ******/
DECLARE @SomeTable TABLE
(
    loginId VARCHAR(255),
    tractorId VARCHAR(255),
    messageTime DATETIME,
    driverStatus VARCHAR(2)
);

INSERT INTO @SomeTable (loginId, tractorId, messageTime, driverStatus)
VALUES('driver35','23533','2018-08-10 8:33 AM','2'),
('driver35','23533','2018-08-10 8:37 AM','2'),
('driver35','23533','2018-08-10 8:56 AM','2'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 9:07 AM','1'),
('driver35','23533','2018-08-10 9:04 AM','1'),
('driver35','23533','2018-08-12 8:07 AM','3'),
('driver35','23533','2018-08-12 8:37 AM','3'),
('driver35','23533','2018-08-12 9:07 AM','3'),
('driver35','23533','2018-06-12 8:07 AM','2'),
('driver35','23533','2018-06-12 8:37 AM','2'),
('driver35','23533','2018-06-12 9:07 AM','2')
;
SELECT *, DENSE_RANK() OVER(PARTITION BY 
  loginId, tractorId, driverStatus 
ORDER BY messageTime ) FROM @SomeTable
;

理想情况下,我的最终结果应该是这样的:

loginId tractorId   startTime           endTime            driverStatus
driver35    23533   2018-08-10 8:33 AM  2018-08-10 8:56 AM      2
driver35    23533   2018-08-10 8:57 AM  2018-08-10 9:07 AM      1
driver35    23533   2018-08-12 8:07 AM  2018-08-12 9:07 AM      3

对此表示任何帮助.

解决方案

WITH drivers_data AS
(
    SELECT *,
           row_num =     ROW_NUMBER()
                         OVER (PARTITION BY loginId,
                                            tractorId,
                                            CAST(messageTime AS date),
                                            driverStatus
                               ORDER BY messageTime),

           row_num_all = ROW_NUMBER()
                         OVER (PARTITION BY loginId,
                                            tractorId
                               ORDER BY messageTime),

           first_date =  FIRST_VALUE (messageTime)
                         OVER (PARTITION BY loginId,
                                            tractorId,
                                            CAST(messageTime AS date),
                                            driverStatus
                               ORDER BY messageTime),

           last_date =   LAST_VALUE (messageTime)
                         OVER (PARTITION BY loginId,
                                            tractorId,
                                            CAST(messageTime AS date),
                                            driverStatus
                               ORDER BY messageTime
                               ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
    FROM @t
)
SELECT loginId, tractorId, first_date, last_date, driverStatus
FROM drivers_data
WHERE row_num = 1
ORDER BY row_num_all;

输出:

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:57:00 | 2018-10-08 09:07:00 | 1            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-12-06 08:07:00 | 2018-12-06 09:07:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-12-08 08:07:00 | 2018-12-08 09:07:00 | 3            |
+----------+-----------+---------------------+---------------------+--------------+

我将尝试解释这里发生的事情:

  1. row_num 这用于对行进行编号,这些行受驱动程序的日期和状态限制.我们需要转换,因为我们需要没有时间的日期部分.
  2. row_num_all 这是关键属性,因为它最终允许我们按出现的顺序对行进行排序.此窗口不受状态限制,因为我们需要对整个驾驶员数据进行编号.
  3. 首次约会 FIRST_VALUE是我们方便的功能.它只是检索第一个日期时间出现.
  4. 最后日期正确地假设,对于最后一个日期,我们需要LAST_VALUE窗口函数.但是使用它很棘手,需要更多的解释.如您所见,我明确地使用了特殊的框架ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING.但为什么?让我解释.让我们使用默认框架获取日期10/8/2018和状态2 的一部分输出.我们得到以下结果:

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2            | 
+----------+-----------+---------------------+---------------------+--------------+

如您所见,最后日期是不正确!发生这种情况是因为LAST_VALUE使用默认框架RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW-这意味着最后一行始终是窗口中的当前行.这是在幕后发生的事情.创建三个窗口.每行都有其自己的窗口.然后,它从窗口中检索最后一行:

第一行的窗口

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
+----------+-----------+---------------------+---------------------+--------------+

第二行的窗口

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2            |
+----------+-----------+---------------------+---------------------+--------------+

第三行窗口

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2            | 
+----------+-----------+---------------------+---------------------+--------------+

因此,解决此问题的方法是更改​​框架:我们不需要从开头移至当前行,而是从当前行移至结尾.因此,UNBOUNDED FOLLOWING就是这个意思-当前窗口的最后一行.

  1. 下一个是WHERE row_num = 1.这很简单:由于所有行都具有关于首个日期和最后日期的相同信息,因此我们只需要第一行.

  2. 最后一部分是ORDER BY row_num_all.这是您正确订购的地方.

P.S.

  1. 您所需要的输出有问题. 对于日期8/10/18 8:57 AM和状态1,最后日期必须为10/8/2018 9:07 AM-而不是您提到的10/8/2018 9:04 AM.

  2. 此外,日期12/6/2018和状态2缺少输出.

更新:

以下是FIRST_VALUELAST_VALUE的工作方式的说明.

所有三个数字均包含以下部分:

  1. 查询数据.这是查询的结果.
  2. 原始查询原始数据.
  3. Windows 这些是计算的中间步骤.
  4. 框架提及所使用的框架.
  5. 绿色单元格窗口规范.

这是引擎盖下发生的事情:

  1. 首先,SQL Server为所有提到的字段创建分区.在图上是partition列.
  2. 每个分区可以有一个框架:默认或自定义.默认框架为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.这意味着该行将在分区的开始位置和当前行之间获得窗口.如果您不提及框架,则会使用默认框架.
  3. 每个框架为每一行创建一个窗口.在图中,这些窗口在row 1row 2列中,并用颜色标记.行号对应于row_num_all字段.
  4. 行仅在其窗口范围内操作.

1. FIRST_VALUE

IMG_FIRST_VALUE"

要获取第一个日期,我们可以使用方便的FIRST_VALUE窗口函数. 如您所见,我们在这里使用默认框架.这意味着对于每一行,窗口将在窗口的开始与当前行之间.对于第一次约会,这正是我们需要的.每行将从第一行获取值.第一个日期在"first_date"字段中.

2. LAST_VALUE-错误的帧

IMG_LAST_VALUE"

现在,我们需要计算上次日期.最后日期在分区的最后一行中,因此我们可以使用LAST_VALUE窗口函数. 如前所述,如果我们不提及框架,则使用默认框架.如图所示,该框架始终在当前行结束-这是不正确,因为我们需要最后一个窗口行的日期. last_date字段向我们显示了不正确的结果-它反映了当前行的日期.

3. LAST_VALUE-正确的帧

IMG_LAST_VALUE_correct_frame"

要解决获取最后日期的问题,我们需要更改LAST_VALUE将在其上操作的框架:ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING.如您所见,每行的窗口现在位于当前行和分区末尾之间.在这种情况下,LAST_VALUE将正确地从窗口的最后一行获取日期.现在last_date字段中的结果正确.

Backstory: I have a database that has data points of drivers in trucks which also contain the. While in a truck, the driver can have a 'driverstatus'. What I'd like to do is group these statuses by driver, truck.

As of now, I've tried using LAG/LEAD to help. The reason for this is so I can tell when a driverstatus change occurs, and then I can mark that row as having the last datetime of that status.

That in itself is insufficient, because I need to group the statuses by their status and date. For this, I've got something such as DENSE_RANK, but I can't manage to get that right concerning the ORDER BY clause.

Here is my test data, and here is one attempt of many of me floundering with ranking.

/****** Script for SelectTopNRows command from SSMS  ******/
DECLARE @SomeTable TABLE
(
    loginId VARCHAR(255),
    tractorId VARCHAR(255),
    messageTime DATETIME,
    driverStatus VARCHAR(2)
);

INSERT INTO @SomeTable (loginId, tractorId, messageTime, driverStatus)
VALUES('driver35','23533','2018-08-10 8:33 AM','2'),
('driver35','23533','2018-08-10 8:37 AM','2'),
('driver35','23533','2018-08-10 8:56 AM','2'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 9:07 AM','1'),
('driver35','23533','2018-08-10 9:04 AM','1'),
('driver35','23533','2018-08-12 8:07 AM','3'),
('driver35','23533','2018-08-12 8:37 AM','3'),
('driver35','23533','2018-08-12 9:07 AM','3'),
('driver35','23533','2018-06-12 8:07 AM','2'),
('driver35','23533','2018-06-12 8:37 AM','2'),
('driver35','23533','2018-06-12 9:07 AM','2')
;
SELECT *, DENSE_RANK() OVER(PARTITION BY 
  loginId, tractorId, driverStatus 
ORDER BY messageTime ) FROM @SomeTable
;

My end result would ideally look something like this:

loginId tractorId   startTime           endTime            driverStatus
driver35    23533   2018-08-10 8:33 AM  2018-08-10 8:56 AM      2
driver35    23533   2018-08-10 8:57 AM  2018-08-10 9:07 AM      1
driver35    23533   2018-08-12 8:07 AM  2018-08-12 9:07 AM      3

Any help on this is greatly appreciated.

解决方案

WITH drivers_data AS
(
    SELECT *,
           row_num =     ROW_NUMBER()
                         OVER (PARTITION BY loginId,
                                            tractorId,
                                            CAST(messageTime AS date),
                                            driverStatus
                               ORDER BY messageTime),

           row_num_all = ROW_NUMBER()
                         OVER (PARTITION BY loginId,
                                            tractorId
                               ORDER BY messageTime),

           first_date =  FIRST_VALUE (messageTime)
                         OVER (PARTITION BY loginId,
                                            tractorId,
                                            CAST(messageTime AS date),
                                            driverStatus
                               ORDER BY messageTime),

           last_date =   LAST_VALUE (messageTime)
                         OVER (PARTITION BY loginId,
                                            tractorId,
                                            CAST(messageTime AS date),
                                            driverStatus
                               ORDER BY messageTime
                               ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
    FROM @t
)
SELECT loginId, tractorId, first_date, last_date, driverStatus
FROM drivers_data
WHERE row_num = 1
ORDER BY row_num_all;

OUTPUT:

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:57:00 | 2018-10-08 09:07:00 | 1            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-12-06 08:07:00 | 2018-12-06 09:07:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-12-08 08:07:00 | 2018-12-08 09:07:00 | 3            |
+----------+-----------+---------------------+---------------------+--------------+

I will try to explain what's going on here:

  1. row_num This is for numbering rows which are restricted by date and status of the driver. We need casting since we need date part without time.
  2. row_num_all This is the key attribute since it allows us in the end to sort rows by occurrence. This window is not restricted by status since we need numbering across whole driver's data.
  3. first_date The FIRST_VALUE is handy function for our purpose. It just retrieves the first datetime occurrence.
  4. last_date It's correct to assume that for the last date we need LAST_VALUE window function. But using it is tricky and requires more explanation. As you can see I explicitly use special framing ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING. But why? Let me explain. Let's take a part of output for date 10/8/2018 and status 2 with default framing. We get the following results:

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2            | 
+----------+-----------+---------------------+---------------------+--------------+

As you can see, the last date is incorrect! This happens because LAST_VALUE uses default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW- it means that last row is always current row in window. Here's what happens under the hood. Three windows get created. Each row gets its own window. Then it retrieves the last row from window:

Window for 1-st row

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
+----------+-----------+---------------------+---------------------+--------------+

Window for 2-nd row

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2            |
+----------+-----------+---------------------+---------------------+--------------+

Window for 3-rd row

+==========+===========+=====================+=====================+==============+
| loginId  | tractorId | first_date          | last_date           | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2            |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533     | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2            | 
+----------+-----------+---------------------+---------------------+--------------+

So, the solution for this is to change framing: we need to move not from beginning to current row, but from current row to the end. So, UNBOUNDED FOLLOWING just means this - last row in current window.

  1. Next is WHERE row_num = 1. This is all simple: since all rows has same information about first date and last date, we just need first row.

  2. The final part is ORDER BY row_num_all. This is where you get your correct ordering.

P.S.

  1. Your desired output in question is incorrect. For date 8/10/18 8:57 AM and status 1 the last date must be 10/8/2018 9:07 AM - not 10/8/2018 9:04 AM, as you mentioned.

  2. Also there's missing output for date 12/6/2018 and status 2.

UPDATE:

Here are illustrations of how FIRST_VALUE and LAST_VALUE work.

All three figures have following parts:

  1. Query data This is result of query.
  2. Original query Original source data.
  3. Windows These are intermediate steps for calculations.
  4. Frame Mentions which frame is used.
  5. Green cell Window specification.

Here's what's happening under the hood:

  1. First, SQL Server creates partitions for all mentioned fields. On figure it is partition column.
  2. Each partition can have a frame: either default or custom. The default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means that the row gets window between the start of partition and current row. If you don't mention a frame, the default frame comes into play.
  3. Each frame creates window for each row. On figures these windows are in columns row 1 to row 2 and marked with color. The row number corresponds to row_num_all field.
  4. A row operates only in bounds of its window.

1. FIRST_VALUE

To get first date, we can use handy FIRST_VALUE window function. As you can see, we use default frame here. This means that for each row the window will be between the start of window and current row. For getting first date this is just what we need. Each row will fetch the value from first row. The first date is in "first_date" field.

2. LAST_VALUE - incorrect frame

Now we need to calculate last date. The last date is in the last row of partition, so we can use LAST_VALUE window function. As I mentioned earlier, if we don't mention frame, the default frame is used. As you can see on figure, the frame always ends in the current row - this is incorrect, because we need the date from last window row. The last_date field shows us incorrect results - it reflects the date from current row.

3. LAST_VALUE - correct frame

To fix the situation with fetching last date, we need to change the frame upon which LAST_VALUE will operate on: ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING. As you can see, the window for each row now is between current row and the end of partition. In this case LAST_VALUE will correctly fetch the date from last row of window. Now the result in last_date field is correct.

这篇关于离岛与缺口问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆