SQL查询回答:如果<事件1> <事件2>发生在时间点A。在B-C时间段内发生? [英] SQL query to answer: If <event 1> occurs in timepoint A, does <event 2> occur in time period B-C?
问题描述
我正在查询一个大型数据集,以确定在不同时间点内的一系列活动事件(即事件1,2,..)是否导致用户在活动后的三天内活动(活动,不活动)每个事件(但不能与活动事件本身在同一天)。
I'm querying a large data set to figure out if a bunch of campaign events (i.e. event 1,2,..) during different timepoints gives a result in user activity (active, inactive) during the following 3 days after each event (but not in the same day as the campaign event itself).
我正在合并两个表来执行此操作,并且它们看起来像这样合并:
I'm merging two tables to do this, and they look like this merged:
| date | user | events | day_activity |
| 2020-01-01 | 1 | event1 | active |
| 2020-01-01 | 2 | event1 | inactive |
| 2020-01-02 | 1 | null | inactive |
| 2020-01-02 | 2 | null | active |
| 2020-01-03 | 1 | null | inactive |
| 2020-01-03 | 2 | null | active |
| 2020-01-04 | 1 | null | active |
| 2020-01-04 | 2 | null | active |
我想要实现的是,对于发生事件的每个用户/日期/事件帮派(=行),添加另一列称为 3_day_activity
的列,该列不包含该事件(=当前行)当天的活动,而仅包含以下3天的活动(每个活动日的得分为1)。该表的第一天的管理示例(我为用户1的添加列中的活动天数添加*,为用户2的列中计数的事件数#)):
What I am trying to achieve is, for each user/date/event gang (= row) where an event occured, to add another column called 3_day_activity
, containing the activity not on the event (= current row) day but the following 3 days only (giving a score of 1 per active day). An example for how the 1st day of this table would look after (I add * in the activity days counted in the added column for user 1, and # for the events counted in the column for user 2)):
| date | user | events | day_activity | 3_day_activity
| 2020-01-01 | 1 | event1 | active | 1
| 2020-01-01 | 2 | event1 | inactive | 3
| 2020-01-02 | 1 | null | inactive * (0)| null (bco no event)
| 2020-01-02 | 2 | null | active # (1) | null (bco no event)
| 2020-01-03 | 1 | null | inactive * (0)| null (bco no event)
| 2020-01-03 | 2 | null | active # (1) | null (bco no event)
| 2020-01-04 | 1 | null | active * (1) | null (bco no event)
| 2020-01-04 | 2 | null | active # (1) | null (bco no event)
我尝试使用窗口函数解决此问题。它可以运行,但是我认为我对如何设计它误解了一些重要的想法,因为结果包含大量重复...
I tried solving this with a window function. It runs, but I think I misunderstood some important idea on how to design it, because the result contains a ton of repetitions...
cm.date,
cm.user,
event,
day_activity,
COUNTIF(active_today = 'active') OVER 3d_later AS 3_day_activity
FROM `customer_message` cm
INNER JOIN `customer_day` ud
ON cm.user = ud.user
AND cm.date = ud.date
WHERE
cm.date > '2019-12-25'
WINDOW 3d_later AS (PARTITION BY user ORDER BY UNIX_DATE(cm.date) RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING)
编辑:
我被要求提供一个示例,说明这种重复的外观。如果添加 ORDER BY 3_day_activity,这就是我看到的内容查询末尾的子句:
I was asked to supply an example of how this repetition might look. Here's what I see if I add an "ORDER BY 3_day_activity" clause at the end of the query:
Row date user day_activity 3_day_activity
1 2020-01-01 2 active 243
2 2020-01-01 2 active 243
3 2020-01-01 2 active 243
4 2020-01-01 2 active 243
5 2020-01-01 2 active 243
6 2020-01-01 2 active 243
7 2020-01-02 2 active 243
8 2020-01-02 2 active 243
EDIT2:
EDIT2 :
这仍然没有解决。.根据一位评论者,但即使问题不相同,我也无法这样做(我想是由于与此相似)。我已经根据用户和日期对分组进行了测试,但是由于没有在 COUNTIF子句中进行汇总,因此我抛出了错误。
This remains unsolved.. I have tried asking another question, as per the suggestion of one commenter, but I am locked from doing so even if the problem is not identical (I suppose due to the similarities to this one). I have tested grouping based on user and date, but I then it instead throws an error due to not aggregating in the 'COUNTIF' clause.
这是上述尝试; SQL:计算时出错,要求进行汇总,分组和窗口化
推荐答案
以下示例适用于BigQuery标准SQL
Below example is for BigQuery Standard SQL
#standardSQL
SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM `project.dataset.table`
WINDOW three_day_activity_window AS (
PARTITION BY user
ORDER BY UNIX_DATE(date)
RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
您可以使用问题中的示例数据来测试,玩游戏,如下例所示
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT DATE '2020-01-01' date , 1 user, 'event1' events, 'active' day_activity UNION ALL
SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
SELECT '2020-01-04', 2, NULL, 'active'
)
SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM `project.dataset.table`
WINDOW three_day_activity_window AS (
PARTITION BY user
ORDER BY UNIX_DATE(date)
RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
ORDER BY date, user
具有输出
Row date user events day_activity three_day_activity
1 2020-01-01 1 event1 active 1
2 2020-01-01 2 event1 inactive 3
3 2020-01-02 1 null inactive 0
4 2020-01-02 2 null active 0
5 2020-01-03 1 null inactive 0
6 2020-01-03 2 null active 0
7 2020-01-04 1 null active 0
8 2020-01-04 2 null active 0
是否更新-避免在同一天内多次注册同一位用户为活跃用户(并计算出一笔不菲的金额)?
Update for - to avoid registering the same user as active multiple times in one day (and tallying those up to a huge sum)?
如果您希望避免计算同一天用户的所有活动-使用以下调整后的版本(请注意示例数据中的额外条目以介绍用户同一天的多项活动)
If you want to avoid counting all activity for user on same day - use below adjusted version (note extra entry in sample data to introduce user's multiple activity on same day)
#standardSQL
WITH `project.dataset.table` AS (
SELECT DATE '2020-01-01' DATE , 1 user, 'event1' events, 'active' day_activity UNION ALL
SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
SELECT '2020-01-04', 2, NULL, 'active'
)
SELECT *,
IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM (
SELECT date, user, MAX(events) events, MIN(day_activity) day_activity
FROM `project.dataset.table`
GROUP BY date, user
)
WINDOW three_day_activity_window AS (
PARTITION BY user
ORDER BY UNIX_DATE(date)
RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
ORDER BY date, user
这篇关于SQL查询回答:如果<事件1> <事件2>发生在时间点A。在B-C时间段内发生?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!