SQL查询回答:如果<事件1> <事件2>发生在时间点A。在B-C时间段内发生? [英] SQL query to answer: If <event 1> occurs in timepoint A, does <event 2> occur in time period B-C?

查看:102
本文介绍了SQL查询回答:如果<事件1> <事件2>发生在时间点A。在B-C时间段内发生?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在查询一个大型数据集,以确定在不同时间点内的一系列活动事件(即事件1,2,..)是否导致用户在活动后的三天内活动(活动,不活动)每个事件(但不能与活动事件本身在同一天)。

I'm querying a large data set to figure out if a bunch of campaign events (i.e. event 1,2,..) during different timepoints gives a result in user activity (active, inactive) during the following 3 days after each event (but not in the same day as the campaign event itself).

我正在合并两个表来执行此操作,并且它们看起来像这样合并:

I'm merging two tables to do this, and they look like this merged:

| date       | user | events | day_activity  | 
| 2020-01-01 | 1    | event1 | active        | 
| 2020-01-01 | 2    | event1 | inactive      | 
| 2020-01-02 | 1    | null   | inactive      |  
| 2020-01-02 | 2    | null   | active        | 
| 2020-01-03 | 1    | null   | inactive      | 
| 2020-01-03 | 2    | null   | active        | 
| 2020-01-04 | 1    | null   | active        | 
| 2020-01-04 | 2    | null   | active        | 

我想要实现的是,对于发生事件的每个用户/日期/事件帮派(=行),添加另一列称为 3_day_activity 的列,该列不包含该事件(=当前行)当天的活动,而仅包含以下3天的活动(每个活动日的得分为1)。该表的第一天的管理示例(我为用户1的添加列中的活动天数添加*,为用户2的列中计数的事件数#)):

What I am trying to achieve is, for each user/date/event gang (= row) where an event occured, to add another column called 3_day_activity, containing the activity not on the event (= current row) day but the following 3 days only (giving a score of 1 per active day). An example for how the 1st day of this table would look after (I add * in the activity days counted in the added column for user 1, and # for the events counted in the column for user 2)):

| date       | user | events | day_activity  | 3_day_activity
| 2020-01-01 | 1    | event1 | active        | 1
| 2020-01-01 | 2    | event1 | inactive      | 3
| 2020-01-02 | 1    | null   | inactive * (0)| null (bco no event)
| 2020-01-02 | 2    | null   | active # (1)  | null (bco no event)
| 2020-01-03 | 1    | null   | inactive * (0)| null (bco no event)
| 2020-01-03 | 2    | null   | active # (1)  | null (bco no event)
| 2020-01-04 | 1    | null   | active * (1)  | null (bco no event)
| 2020-01-04 | 2    | null   | active # (1)  | null (bco no event)

我尝试使用窗口函数解决此问题。它可以运行,但是我认为我对如何设计它误解了一些重要的想法,因为结果包含大量重复...

I tried solving this with a window function. It runs, but I think I misunderstood some important idea on how to design it, because the result contains a ton of repetitions...

  cm.date, 
  cm.user,
  event,
  day_activity,
  COUNTIF(active_today = 'active') OVER 3d_later AS 3_day_activity
FROM `customer_message` cm
INNER JOIN `customer_day` ud
  ON cm.user = ud.user
  AND cm.date = ud.date
WHERE 
  cm.date > '2019-12-25'
WINDOW 3d_later AS (PARTITION BY user ORDER BY UNIX_DATE(cm.date) RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING)




编辑:


我被要求提供一个示例,说明这种重复的外观。如果添加 ORDER BY 3_day_activity,这就是我看到的内容查询末尾的子句:

I was asked to supply an example of how this repetition might look. Here's what I see if I add an "ORDER BY 3_day_activity" clause at the end of the query:

Row    date         user  day_activity  3_day_activity  
1      2020-01-01   2     active        243
2      2020-01-01   2     active        243
3      2020-01-01   2     active        243
4      2020-01-01   2     active        243
5      2020-01-01   2     active        243
6      2020-01-01   2     active        243
7      2020-01-02   2     active        243
8      2020-01-02   2     active        243




EDIT2:


EDIT2 :

这仍然没有解决。.根据一位评论者,但即使问题不相同,我也无法这样做(我想是由于与此相似)。我已经根据用户和日期对分组进行了测试,但是由于没有在 COUNTIF子句中进行汇总,因此我抛出了错误。

This remains unsolved.. I have tried asking another question, as per the suggestion of one commenter, but I am locked from doing so even if the problem is not identical (I suppose due to the similarities to this one). I have tested grouping based on user and date, but I then it instead throws an error due to not aggregating in the 'COUNTIF' clause.

这是上述尝试; SQL:计算时出错,要求进行汇总,分组和窗口化

推荐答案

以下示例适用于BigQuery标准SQL

Below example is for BigQuery Standard SQL

#standardSQL
SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM `project.dataset.table`
WINDOW three_day_activity_window AS (
  PARTITION BY user 
  ORDER BY UNIX_DATE(date) 
  RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)

您可以使用问题中的示例数据来测试,玩游戏,如下例所示

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT DATE '2020-01-01' date , 1 user, 'event1' events, 'active' day_activity UNION ALL
  SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
  SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
  SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
  SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
  SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
  SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
  SELECT '2020-01-04', 2, NULL, 'active' 
)
SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM `project.dataset.table`
WINDOW three_day_activity_window AS (
  PARTITION BY user 
  ORDER BY UNIX_DATE(date) 
  RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
ORDER BY date, user   

具有输出

Row date        user    events  day_activity    three_day_activity   
1   2020-01-01  1       event1  active          1    
2   2020-01-01  2       event1  inactive        3    
3   2020-01-02  1       null    inactive        0    
4   2020-01-02  2       null    active          0    
5   2020-01-03  1       null    inactive        0    
6   2020-01-03  2       null    active          0    
7   2020-01-04  1       null    active          0    
8   2020-01-04  2       null    active          0       



是否更新-避免在同一天内多次注册同一位用户为活跃用户(并计算出一笔不菲的金额)?

Update for - to avoid registering the same user as active multiple times in one day (and tallying those up to a huge sum)?

如果您希望避免计算同一天用户的所有活动-使用以下调整后的版本(请注意示例数据中的额外条目以介绍用户同一天的多项活动)

If you want to avoid counting all activity for user on same day - use below adjusted version (note extra entry in sample data to introduce user's multiple activity on same day)

#standardSQL
WITH `project.dataset.table` AS (
  SELECT DATE '2020-01-01' DATE , 1 user, 'event1' events, 'active' day_activity UNION ALL
  SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
  SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
  SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
  SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
  SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
  SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
  SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
  SELECT '2020-01-04', 2, NULL, 'active' 
)
SELECT *, 
  IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM (
  SELECT date, user, MAX(events) events, MIN(day_activity) day_activity
  FROM `project.dataset.table` 
  GROUP BY date, user
)
WINDOW three_day_activity_window AS (
  PARTITION BY user 
  ORDER BY UNIX_DATE(date) 
  RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
ORDER BY date, user

这篇关于SQL查询回答:如果<事件1> <事件2>发生在时间点A。在B-C时间段内发生?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆