按时间范围选择不同的用户组 [英] Select distinct users group by time range
问题描述
我有一个包含以下信息的表
I have a table with the following info
|date | user_id | week_beg | month_beg|
SQL创建具有测试值的表:
SQL to create table with test values:
CREATE TABLE uniques
(
date DATE,
user_id INT,
week_beg DATE,
month_beg DATE
)
INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01')
输入表:
| date | user_id | week_beg | month_beg |
| 2013-01-01 | 1 | 2012-12-30 | 2013-01-01 |
| 2013-01-03 | 3 | 2012-12-30 | 2013-01-01 |
| 2013-01-06 | 4 | 2013-01-06 | 2013-01-01 |
| 2013-01-07 | 4 | 2013-01-06 | 2013-01-01 |
输出表:
| date | time_series | cnt |
| 2013-01-01 | D | 1 |
| 2013-01-01 | W | 1 |
| 2013-01-01 | M | 1 |
| 2013-01-03 | D | 1 |
| 2013-01-03 | W | 2 |
| 2013-01-03 | M | 2 |
| 2013-01-06 | D | 1 |
| 2013-01-06 | W | 1 |
| 2013-01-06 | M | 3 |
| 2013-01-07 | D | 1 |
| 2013-01-07 | W | 1 |
| 2013-01-07 | M | 3 |
我想计算日期中不同user_id的数量:
I want to calculate the number of distinct user_id's for a date:
-
该日期为止
For that date
该日期之前的那个星期(迄今为止的一周)
For that week up to that date (Week to date)
该日期为止的月份(本月至今)
For the month up to that date (Month to date)
1很容易计算。
对于2和3,我尝试使用以下查询:
1 is easy to calculate. For 2 and 3 I am trying to use such queries:
SELECT
date,
'W' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt"
FROM user_subtitles
SELECT
date,
'M' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt"
FROM user_subtitles
Postgres不允许窗口函数进行DISTINCT计算,因此这种方法行不通。
Postgres does not allow window functions for DISTINCT calculation, so this approach does not work.
我也尝试了GROUP BY方法,但是它不能工作,因为它给出了整个星期/月的数字。
I have also tried out a GROUP BY approach, but it does not work as it gives me numbers for whole week/months.
最好的方法是什么解决这个问题?
Whats the best way to approach this problem?
推荐答案
对所有行计数
Count all rows
SELECT date, '1_D' AS time_series, count(DISTINCT user_id) AS cnt
FROM uniques
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date)
FROM uniques
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date)
FROM uniques
ORDER BY 1, time_series
-
您的列
week_beg
和month_beg
是100%冗余的,可以轻松地由
date_trunc('week',date + 1)-1
和date_trunc替换(分别为月,日期)
。Your columns
week_beg
andmonth_beg
are 100 % redundant and can easily be replaced bydate_trunc('week', date + 1) - 1
anddate_trunc('month', date)
respectively.您的一周似乎从星期日开始(减少了一个),因此
+1 ..-1
。Your week seems to start on Sunday (off by one), therefore the
+ 1 .. - 1
.具有
ORDER BY $>的窗口函数的默认框架
OVER
子句中使用的c $ c>是无边界先行量和当前行
之间的范围。正是您所需要的。The default frame of a window function with
ORDER BY
in theOVER
clause uses isRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
. That's exactly what you need.使用
UNION ALL
,而不是UNION
。您不幸的选择
time_series
(D,W,M)排序不好,我重新命名以使最终的ORDER BY
更容易。Your unfortunate choice for
time_series
(D, W, M) does not sort well, I renamed to make the finalORDER BY
easier.此查询可以处理每天有多行。计数包括一天中的所有同行。
This query can deal with multiple rows per day. Counts include all peers for a day.
有关
DISTINCT ON
的更多信息:- Select first row in each GROUP BY group?
要每天仅计算一次每个用户,请使用 CTE 和
DISTINCT ON
:To count every user only once per day, use a CTE with
DISTINCT ON
:WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques) SELECT date, '1_D' AS time_series, count(user_id) AS cnt FROM x GROUP BY 1 UNION ALL SELECT DISTINCT ON (1) date, '2_W' ,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1) ORDER BY date) FROM x UNION ALL SELECT DISTINCT ON (1) date, '3_M' ,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date) FROM x ORDER BY 1, 2
动态时段内的DISTINCT用户
您始终可以求助于相关子查询。大表往往会变慢!
基于先前的查询:DISTINCT users over dynamic period of time
You can always resort to correlated subqueries. Tend to be slow with big tables!
Building on the previous queries:WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2) ,d AS ( SELECT date ,(date_trunc('week', date + 1)::date - 1) AS week_beg ,date_trunc('month', date)::date AS month_beg FROM uniques GROUP BY 1 ) SELECT date, '1_D' AS time_series, count(user_id) AS cnt FROM du GROUP BY 1 UNION ALL SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du WHERE du.date BETWEEN d.week_beg AND d.date ) FROM d GROUP BY date, week_beg UNION ALL SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du WHERE du.date BETWEEN d.month_beg AND d.date) FROM d GROUP BY date, month_beg ORDER BY 1,2;
@Clodoaldo 来了进行重大改进:使用窗口函数
density_rank()
。这是优化版本的另一个想法。立即排除每日重复数据应该更快。性能增益随每天的行数而增长。@Clodoaldo came up with a major improvement: use the window function
dense_rank()
. Here is another idea for an optimized version. It should be even faster to exclude daily duplicates right away. The performance gain grows with the number of rows per day.建立在简化且经过清理的数据模型上
-没有冗余将列
-day
作为列名,而不是date
Building on a simplified and sanitized data model - without the redundant columns -
day
as column name instead ofdate
date
是标准SQL中的保留字和PostgreSQL中的基本类型名,不应用作标识符。date
is a reserved word in standard SQL and a basic type name in PostgreSQL and shouldn't be used as identifier.CREATE TABLE uniques( day date -- instead of "date" ,user_id int );
改进的查询:
WITH du AS ( SELECT DISTINCT ON (1, 2) day, user_id ,date_trunc('week', day + 1)::date - 1 AS week_beg ,date_trunc('month', day)::date AS month_beg FROM uniques ) SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m FROM ( SELECT user_id, day ,dense_rank() OVER(PARTITION BY week_beg ORDER BY user_id) AS w ,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m FROM du ) s GROUP BY day ORDER BY day;
SQL小提琴 演示了4种更快的变体的性能。
所有这些文件的速度大约是相关子查询版本的10倍(这对相关子查询来说还不错)。SQL Fiddle demonstrating the performance of 4 faster variants. It depends on your data distribution which is fastest for you.
All of them are about 10x as fast as the correlated subqueries version (which isn't bad for correlated subqueries).这篇关于按时间范围选择不同的用户组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!