具有可变滞后长度的困难 [英] Difficulty with variable lag length

查看:30
本文介绍了具有可变滞后长度的困难的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对这个问题的措辞表示歉意,但我不知道如何描述这个问题.

我有俱乐部会员数据,如下

<预><代码>|身份证 |开始 |结束 |取消 ||--- |------------------- |------------------- |------------------- ||1 |2020-01-01 00:00:00 |2020-12-31 00:00:00 |2021-01-10 00:00:00 ||1 |2021-02-01 00:00:00 |2022-01-31 00:00:00 |||2 |2021-01-01 00:00:00 |2020-12-31 00:00:00 |||3 |2020-01-01 00:00:00 |2020-06-30 00:00:00 |2020-07-01 00:00:00 ||3 |2020-07-10 00:00:00 |2021-01-09 00:00:00 |2021-01-31 00:00:00 ||3 |2021-02-02 00:00:00 |2021-08-01 00:00:00 |||4 |2018-04-28 00:00:00 |2019-04-28 00:00:00 |2020-01-03 00:00:00 ||4 |2019-03-07 00:00:00 |2021-08-01 00:00:00 |||4 |2020-02-22 00:00:00 |2021-02-22 00:00:00 ||

兴趣集中在取消会员资格和取消新会员资格之间的时间.我有以下代码,这是在 Gordon Linoff 对我的上一个问题的回答中提供的

选择t.*,datediff(start, prev_cancelled) 为 num_days_since_cancel从(选择 dt.*,滞后(取消)超过(按 id 按开始顺序分区)作为 prev_cancelled从 dt) t其中 prev_cancelled 不为空;

产生以下结果:

<预><代码>|身份证 |开始 |结束 |取消 |prev_canceled |num_days_since_cancel ||--- |------------------- |------------------- |------------------- |------------------- |--------------------- ||1 |2021-02-01 00:00:00 |2022-01-31 00:00:00 ||2021-01-10 00:00:00 |22 ||3 |2020-07-10 00:00:00 |2021-01-09 00:00:00 |2021-01-31 00:00:00 |2020-07-01 00:00:00 |9 ||3 |2021-02-02 00:00:00 |2021-08-01 00:00:00 ||2021-01-31 00:00:00 |2 ||4 |2019-03-07 00:00:00 |2021-08-01 00:00:00 ||2020-01-03 00:00:00 |-302 |

这是除 id 4 之外的所有所需的输出,它应该返回 50 天,而不是 -302.我明白为什么这样做,因为滞后 1,所以在 id 4 的情况下,我们需要滞后 2,因为在取消第一个会员资格之前,第二个会员资格尚未结束,但是我找不到方法去做这个.此外,虽然这是一个边缘情况,但在第一个被取消之前,没有理由不应该有 2 个或更多后续未过期的会员资格.

也许更好的方法是以某种方式过滤掉在第一个成员资格之后开始并在第一个成员资格被取消之后到期的任何未到期的成员资格.或者也许有更好的方法?玩具数据:

如果不存在则创建表`dt`(`id` int(6) 无符号非空,`开始`时间戳,`结束`时间戳,已取消"时间戳 NULL) 默认字符集=utf8;插入`dt`(`id`、`start`、`end`、`cancelled`)值('1', '2020-01-01', '2020-12-31', '2021-01-10'),('1', '2021-02-01', '2022-01-31', NULL),('2', '2021-01-01', '2020-12-31', NULL),('3', '2020-01-01', '2020-06-30', '2020-07-01'),('3', '2020-07-10', '2021-01-09', '2021-01-31'),('3', '2021-02-02', '2021-08-01', NULL),('4', '2018-04-28', '2019-04-28', '2020-01-03'),('4', '2019-03-07', '2021-08-01', NULL),('4', '2020-02-22', '2021-02-22', NULL )

解决方案

不是与前一行中的取消日期进行比较,而是需要与所有前一行中的最新取消日期进行比较.标准 SQL 有一个 IGNORE NULLS 选项来实现这一点,但 MySQL 不支持它.幸运的是,在您的情况下,它可以使用累积最大值重写:

选择t.*,datediff(start, prev_cancelled) 为 num_days_since_cancel从(选择 dt.*,max(cancelled) over -- 每个 id 的最新日期(按 id 分区按开始顺序无界前面和前面 1 个之间的行)作为 prev_cancelled从 dt) t-- 删除负持续时间其中 datediff(start, prev_cancelled) >= 0;

参见 fiddle

Apologies for the wording of this question, but I wasn't sure how to describe this problem.

I have data on club memberships, as follows


| id  | start               | end                 | cancelled           |
| --- | ------------------- | ------------------- | ------------------- |
| 1   | 2020-01-01 00:00:00 | 2020-12-31 00:00:00 | 2021-01-10 00:00:00 |
| 1   | 2021-02-01 00:00:00 | 2022-01-31 00:00:00 |                     |
| 2   | 2021-01-01 00:00:00 | 2020-12-31 00:00:00 |                     |
| 3   | 2020-01-01 00:00:00 | 2020-06-30 00:00:00 | 2020-07-01 00:00:00 |
| 3   | 2020-07-10 00:00:00 | 2021-01-09 00:00:00 | 2021-01-31 00:00:00 |
| 3   | 2021-02-02 00:00:00 | 2021-08-01 00:00:00 |                     |
| 4   | 2018-04-28 00:00:00 | 2019-04-28 00:00:00 | 2020-01-03 00:00:00 |
| 4   | 2019-03-07 00:00:00 | 2021-08-01 00:00:00 |                     |
| 4   | 2020-02-22 00:00:00 | 2021-02-22 00:00:00 |                     |

Interest is centered on the time between a membership being cancelled and a new membership being taken out. I have the following code, which was provided in Gordon Linoff's answer to my previous question

select t.*,
       datediff(start, prev_cancelled) as num_days_since_cancel
from (select dt.*,
             lag(cancelled) over (partition by id order by start) as prev_cancelled
      from dt
     ) t
where prev_cancelled is not null;

which produces the following:


| id  | start               | end                 | cancelled           | prev_cancelled      | num_days_since_cancel |
| --- | ------------------- | ------------------- | ------------------- | ------------------- | --------------------- |
| 1   | 2021-02-01 00:00:00 | 2022-01-31 00:00:00 |                     | 2021-01-10 00:00:00 | 22                    |
| 3   | 2020-07-10 00:00:00 | 2021-01-09 00:00:00 | 2021-01-31 00:00:00 | 2020-07-01 00:00:00 | 9                     |
| 3   | 2021-02-02 00:00:00 | 2021-08-01 00:00:00 |                     | 2021-01-31 00:00:00 | 2                     |
| 4   | 2019-03-07 00:00:00 | 2021-08-01 00:00:00 |                     | 2020-01-03 00:00:00 | -302                  |

This is the desired output for all except id 4, which should return 50 days, not -302. I understand why it does this, because the lag of 1, so in the case of id 4 we would need a lag of 2, because the 2nd membership had not ended before the first one was cancelled, however I can't find a way to do this. Moreover, although this is an edge case, there is no reason why there should not be 2 or more subsequent unexpired memberships prior to the first one being cancelled.

Perhaps a better approach would be to somehow filter out any unexpired memberships that start subsequent to the first membership and expire subsequent to the first membership being cancelled. Or maybe there is a better way ? Toy Data:

CREATE TABLE IF NOT EXISTS `dt` (
  `id` int(6) unsigned NOT NULL,
  `start` TIMESTAMP,
  `end` TIMESTAMP,
  `cancelled` TIMESTAMP NULL
) DEFAULT CHARSET=utf8;
INSERT INTO `dt` (`id`, `start`, `end`, `cancelled`) VALUES
  ('1', '2020-01-01', '2020-12-31', '2021-01-10'),
  ('1', '2021-02-01', '2022-01-31', NULL ),
  ('2', '2021-01-01', '2020-12-31', NULL ),
  ('3', '2020-01-01', '2020-06-30', '2020-07-01'),
  ('3', '2020-07-10', '2021-01-09', '2021-01-31'),
  ('3', '2021-02-02', '2021-08-01', NULL ),
  ('4', '2018-04-28', '2019-04-28', '2020-01-03' ),
  ('4', '2019-03-07', '2021-08-01', NULL ),
  ('4', '2020-02-22', '2021-02-22', NULL )

解决方案

Instead of comparing to the cancel date in the previous row you need to compare to the latest cancel date in all previous rows. Standard SQL has a IGNORE NULLS option to achieve this, but MySQL doesn't support it. Luckily in your case it can be rewritten using a Cumulative Max:

select t.*,
       datediff(start, prev_cancelled) as num_days_since_cancel
from (select dt.*,
             max(cancelled) over -- latest date per id
             (partition by id 
              order by start
              rows between unbounded preceding and 1 preceding) as prev_cancelled
      from dt
     ) t
-- remove negative duration
where datediff(start, prev_cancelled) >= 0;

See fiddle

这篇关于具有可变滞后长度的困难的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆