SQL中的同类群组分析 [英] Cohort analysis in SQL

查看:220
本文介绍了SQL中的同类群组分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

希望对用户群进行一些同类群组分析.我们有2个表"users"和"sessions",其中用户和会话都具有"created_at"字段.我正在寻找一个查询,该查询将产生一个7 x 7的数字表(带有空白),该表向我显示:在特定日期创建的用户数也创建了会话的用户数y =(0..6天前),表示他于当天返回.

Looking to do some cohort analysis on a userbase. We have 2 tables "users" and "sessions", where users and sessions both have a "created_at" field. I'm looking to formulate a query that yields a 7 by 7 table of numbers (with some blanks) that shows me: a count of users who were created on a particular day who also have a session created y = (0..6 days ago), indicating that he returned on that day.

created_at  d2  d3  d4
today       *   *   *
today-1     49  *   *
today-2     45  30  *
today-3     47  48  18
...

在这种情况下,今天3上创建的47位用户在今天2上返回.

In this case, 47 users who were created on today-3 returned on today-2.

我可以在单个MySQL查询中执行此操作吗?我可以像这样单独执行查询,但是在一个查询中将它们全部都很好是很不错的.

Can I perform this in a single MySQL query? I can perform the queries individually like so, but it'd be really nice to have it all in one query.

SELECT `users`.* FROM `users` INNER JOIN `sessions` ON `sessions`.`user_id` = `users`.`id` WHERE `users`.`os` = 'ios' AND (`sessions`.`updated_at` BETWEEN '2013-01-16 08:00:00' AND '2013-01-17 08:00:00')

推荐答案

这似乎是一个复杂的问题.不管您是否也觉得困难,从一个较小的问题着手解决这个问题都不是坏主意.

This seems a complex problem. Regardless of whether it also seems to you a difficult one or not, it is never a bad idea to start working it up from a smaller problem.

例如,您可以从查询开始,根据您的要求返回上一周内(即从现在起六天内)开始注册的所有用户(仅这些用户):

You could start, for instance, with a query returning all the users (just the users) that have been registered within the last week, i.e. starting from the day six days from now, as per your requirement:

SELECT *
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY

下一步可能是按日期对结果进行分组,并对每组中的行进行计数:

The next step could be grouping the results by dates and counting rows in every group:

SELECT
  created_at,
  COUNT(*) AS user_count
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY
GROUP BY created_at

如果created_atdatetimetimestamp,请使用DATE(created_at)作为分组标准:

If created_at is a datetime or timestamp, use DATE(created_at) as the grouping criterion:

SELECT
  DATE(created_at) AS created_at,
  COUNT(*) AS user_count
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY
GROUP BY DATE(created_at)

但是,您似乎不需要输出中的绝对日期,而只想要相对日期,例如todaytoday - 1 day等.在这种情况下,则可以使用 DATEDIFF() 函数,它返回两个日期之间的天数,以产生从今天开始的(数字)偏移量,并按这些值分组:

However, you don't seem to want absolute dates in the output, but only relative ones, like today, today - 1 day etc. In that case, you could use the DATEDIFF() function, which returns the number of days between two dates, to produce (numeric) offsets from today and group by those values:

SELECT
  DATEDIFF(CURDATE(), created_at) AS created_at,
  COUNT(*) AS user_count
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY
GROUP BY DATE(created_at)

您的created_at列将包含日期",例如01,依此类推,直到6.将它们转换为todaytoday-1等非常简单,您将在最终查询中看到这一点.但是,到目前为止,我们已经到了需要退后一步的地步(或者,也许是向右退了一半),因为我们实际上并不需要统计用户,而是统计他们的返回.因此,目前所需的users实际工作数据集将是:

Your created_at column would contain "dates" like 0, 1 and so on till 6. Converting them into today, today-1 etc. is trivial and you will see that in the final query. So far, however, we've reached the point at which we need to take one step back (or, perhaps, it's rather a half step to the right), because we don't really need to count the users but rather their returns. So, the actual working dataset from users that is needed at the moment will be this:

SELECT
  id,
  DATEDIFF(CURDATE(), created_at) AS day_offset
FROM users
WHERE created_at >= CURDATE() - INTERVAL 6 DAY

我们需要用户ID才能将此行集加入到sessions(将要从中得到的行集)中,并且需要day_offset作为分组条件.

We need user IDs to join this rowset to (the one that will be derived from) sessions and we need day_offset as the grouping criterion.

继续,将需要在sessions表上执行类似的转换,在此我不再赘述.可以说,结果查询与上一个查询非常相同,只有两个例外:

Moving on, a similar transformation will need to be performed on the sessions table, and I won't go into details on that. Suffice it to say that the resulting query will be very identical to the last one, with just two exception:

  • id被替换为user_id;

DISTINCT应用于整个子集.

DISTINCT is applied to the entire subset.

DISTINCT的原因是每个用户返回的行数不超过&一天:据我了解,用户在某一天可能进行了多少次会话,您都希望将其计为一次回报.所以,这是从sessions派生的:

The reason for DISTINCT is to return no more than one row per user & day: it is my understanding that however many sessions a user might have on a particular day, you want to count them as one return. So, here's what gets derived from sessions:

SELECT DISTINCT
  user_id,
  DATEDIFF(CURDATE(), created_at) AS day_offset
FROM sessions
WHERE created_at >= CURDATE() - INTERVAL 6 DAY

现在只剩下连接两个派生表,应用分组并使用条件聚合来获得所需结果的情况了:

Now it only remains to join the two derived tables, apply grouping and use conditional aggregation to get the required results:

SELECT
  CONCAT('today', IFNULL(CONCAT('-', NULLIF(u.DayOffset, 0)), '')) AS created_at,
  SUM(s.DayOffset = 0) AS d0,
  SUM(s.DayOffset = 1) AS d1,
  SUM(s.DayOffset = 2) AS d2,
  SUM(s.DayOffset = 3) AS d3,
  SUM(s.DayOffset = 4) AS d4,
  SUM(s.DayOffset = 5) AS d5,
  SUM(s.DayOffset = 6) AS d6
FROM (
  SELECT
    id,
    DATEDIFF(CURDATE(), created_at) AS DayOffset
  FROM users
  WHERE created_at >= CURDATE() - INTERVAL 6 DAY
) u
LEFT JOIN (
  SELECT DISTINCT
    user_id,
    DATEDIFF(CURDATE(), created_at) AS DayOffset
  FROM sessions
  WHERE created_at >= CURDATE() - INTERVAL 6 DAY
) s
ON u.id = s.user_id
GROUP BY u.DayOffset
;

我必须承认我没有对此进行测试/调试,但是,如果需要,一旦提供了数据样本,我将很乐意处理您将提供的数据样本. :)

I must admit that I haven't tested/debugged this, but, if this be needed, I'll be happy to work with the data samples you will have provided, once you have provided them. :)

这篇关于SQL中的同类群组分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆