计算Postgresql中的累计总数 [英] Count cumulative total in Postgresql
问题描述
我正在使用 count
和 group by
来获取每天注册的订户数量:
I am using count
and group by
to get the number of subscribers registered each day:
SELECT created_at, COUNT(email)
FROM subscriptions
GROUP BY created at;
结果:
created_at count
-----------------
04-04-2011 100
05-04-2011 50
06-04-2011 50
07-04-2011 300
我想而是获取每天的累计订阅者总数。我怎么得到这个?
I want to get the cumulative total of subscribers every day instead. How do I get this?
created_at count
-----------------
04-04-2011 100
05-04-2011 150
06-04-2011 200
07-04-2011 500
推荐答案
对于较大的数据集, 窗口功能 是执行此类查询的最有效方法-表格将被扫描仅一次,而不是每个日期一次,就像自联接一样。它看起来也简单得多。 :) PostgreSQL 8.4及更高版本支持窗口功能。
With larger datasets, window functions are the most efficient way to perform these kinds of queries -- the table will be scanned only once, instead of once for each date, like a self-join would do. It also looks a lot simpler. :) PostgreSQL 8.4 and up have support for window functions.
它是这样的:
SELECT created_at, sum(count(email)) OVER (ORDER BY created_at)
FROM subscriptions
GROUP BY created_at;
此处 OVER
创建窗口; ORDER BY created_at
表示必须按 created_at
的顺序求和。
Here OVER
creates the window; ORDER BY created_at
means that it has to sum up the counts in created_at
order.
编辑:如果要在一天内删除重复的电子邮件,可以使用 sum(计数(不同的电子邮件))
。不幸的是,这不会删除跨越不同日期的重复项。
If you want to remove duplicate emails within a single day, you can use sum(count(distinct email))
. Unfortunately this won't remove duplicates that cross different dates.
如果您要删除所有重复项,我认为最简单的方法是使用子查询和 DISTINCT ON
。这会将电子邮件归为最早的日期(因为我是按created_at升序排序的,因此它将选择最早的电子邮件):
If you want to remove all duplicates, I think the easiest is to use a subquery and DISTINCT ON
. This will attribute emails to their earliest date (because I'm sorting by created_at in ascending order, it'll choose the earliest one):
SELECT created_at, sum(count(email)) OVER (ORDER BY created_at)
FROM (
SELECT DISTINCT ON (email) created_at, email
FROM subscriptions ORDER BY email, created_at
) AS subq
GROUP BY created_at;
如果在(电子邮件,created_at)上创建索引
,此查询也不应太慢。
If you create an index on (email, created_at)
, this query shouldn't be too slow either.
(如果要测试,这就是方法我创建了示例数据集)
(If you want to test, this is how I created the sample dataset)
create table subscriptions as
select date '2000-04-04' + (i/10000)::int as created_at,
'foofoobar@foobar.com' || (i%700000)::text as email
from generate_series(1,1000000) i;
create index on subscriptions (email, created_at);
这篇关于计算Postgresql中的累计总数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!