计算每个影响者随时间的关注者增长 [英] Calculating follower growth over time for each influencer
问题描述
我每天都有一张桌子,上面有影响者和他们的追随者计数器:
influencer_id |日期|追随者1 |2020-05-29 |73611 |2020-05-28 |7234...2 |2020-05-29 |822 |2020-05-28 |85...3 |2020-05-29 |34343 |2020-05-28 |29883 |2020-05-27 |2765...
比方说,我想计算每个影响者在过去7天内获得了多少关注者,并获得了下表:
influencer_id |生长1 |<前一天的关注者数量-第一天的关注者数量>2 |"3 |"
我第一次尝试这样做:
SELECT impactr_id,(最大(跟随者)-最小(跟随者))AS增长来自Impactr_follower_daily日期<'2020-05-30'AND日期> ='2020-05-23'GROUP BY impactr_id;
这有效,并显示了每个影响者一周内的增长情况.但它假设关注者人数始终在增加,人们永远不会关注!
那么有没有一种方法可以对原始表使用SQL查询来实现我想要的?还是我必须使用 FOR
循环生成一个全新的表,该循环计算每个日期之间的+/-跟随者变化列?
标准Postgres中未实现简单的聚合函数 first()
和 last()
.但是请参见下文.
1. array_agg()
Gordon演示了使用 array_agg()
进行的查询,但这比必要的要昂贵,尤其是每组有很多行时.两次调用时,每个聚合使用 ORDER BY
时,甚至更是如此.这种等效的替代方法应该快得多:
SELECT impactr_id,arr [array_upper(arr,1)]-arr [1]从 (SELECT impactr_id,array_agg(followers)AS arr从 (SELECTfluencer_id,关注者来自Impactr_follower_daily日期> ='2020-05-23'AND日期<'2020-05-30'依影响者ID排序,日期)sub1GROUP BY impactr_id)sub2;
因为它对一次进行排序并汇总一次.内部子查询 sub1
的排序顺序将转移到下一个级别.参见:
索引很重要:
-
如果您查询整个表或其中的大部分表,则
(影响者ID,日期,关注者)
上的索引可以(在很多方面)为索引提供帮助-仅扫描. -
如果仅查询表的一小部分,则在
(日期)
或(日期,influencer_id,关注者)上使用索引
可以提供很多帮助.
2. DISTINCT
&窗口功能
Gordon还演示了具有窗口功能的 DISTINCT
.同样,可以快得多:
SELECT DISTINCT ON(influencer_id)impactr_id,last_value(followers)OVER(PARTITION BY impactr_id ORDER BY日期)未绑定的前缀和未绑定的跟随行之间的行)-追随者成长来自Impactr_follower_daily日期> ='2020-05-23'AND日期<'2020-05-30'ORDER BY impactr_id,日期;
具有单个窗口函数,使用与主要查询相同的排序顺序(!).为此,我们需要非默认的窗口定义,其 ROWS BETWEEN ...
请参见:
和 DISTINCT ON
代替 DISTINCT
.参见:
3.自定义集合函数
first()
和 last()
您可以自己添加这些,这很简单.请参见Postgres Wiki中的说明..
或安装附加模块 first_last_agg
并在C中实现更快的实现./p>
相关:
然后您的查询变得更简单:
SELECT impactr_id,last(跟随者)-AS(增长)跟随者从 (SELECTfluencer_id,关注者来自Impactr_follower_daily日期> ='2020-03-02'AND日期<'2020-05-09'依影响者ID排序,日期)zGROUP BY impactr_idORDER BY impactr_id;
自定义聚合 growth()
您可以在单个聚合函数中组合 first()
和 last()
.这样更快,但是调用两个C函数仍会胜过一个自定义SQL函数.
基本上将我的第一个查询的逻辑封装在一个自定义集合中:
创建或替换功能f_growth(anyarray)返回任何元素语言SQL不兼容严格并行AS'SELECT $ 1 [array_upper($ 1,1)]-$ 1 [1]';创建或替换总体增长(任何元素)(SFUNC = array_append,STYPE =任何数组,FINALFUNC = f_growth,并行=安全);
适用于任何数字类型(或任何带有运算符 type-type
并返回相同类型的类型).查询更简单了:
SELECT impactr_id,增长(关注者)从 (SELECTfluencer_id,关注者来自Impactr_follower_daily日期> ='2020-05-23'AND日期<'2020-05-30'依影响者ID排序,日期)zGROUP BY impactr_idORDER BY impactr_id;
或更慢,但最终会变短:
选择impacter_id,增长(跟随者按日期排序)来自Impactr_follower_daily日期> ='2020-05-23'AND日期<'2020-05-30'分组1ORDER BY 1;
db<>小提琴此处 >
4.每组许多行的性能优化
每个组/分区有许多行,其他查询技术可以(很多)更快.这些技术:
如果适用,我建议您开始一个新问题,披露确切的表定义和基数...
密切相关:
I have a table with influencers and their follower counter for each day:
influencer_id | date | followers
1 | 2020-05-29 | 7361
1 | 2020-05-28 | 7234
...
2 | 2020-05-29 | 82
2 | 2020-05-28 | 85
...
3 | 2020-05-29 | 3434
3 | 2020-05-28 | 2988
3 | 2020-05-27 | 2765
...
Let's say I want to calculate how many followers each individual influencer has gained in the last 7 days and get the following table:
influencer_id | growth
1 | <num followers last day - num followers first day>
2 | "
3 | "
As a first attempt I did this:
SELECT influencer_id,
(MAX(followers) - MIN(followers)) AS growth
FROM influencer_follower_daily
WHERE date < '2020-05-30'
AND date >= '2020-05-23'
GROUP BY influencer_id;
This works and shows the growth over the week for each influencer. But it assumes the follower count always increases and people never unfollow!
So is there a way to achieve what I want using an SQL query over the original table? Or will I have to generate a completely new table using a FOR
loop that calculates a +/- follower change column between each date?
The simple aggregate functions first()
and last()
are not implemented in standard Postgres. But see below.
1. array_agg()
Gordon demonstrated a query with array_agg()
, but that's more expensive than necessary, especially with many rows per group. Even more so when called twice, and with ORDER BY
per aggregate. This equivalent alternative should be substantially faster:
SELECT influencer_id, arr[array_upper(arr, 1)] - arr[1]
FROM (
SELECT influencer_id, array_agg(followers) AS arr
FROM (
SELECT influencer_id, followers
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
ORDER BY influencer_id, date
) sub1
GROUP BY influencer_id
) sub2;
Because it sorts once and aggregates once. The sort order of the inner subquery sub1
is carried over to the next level. See:
Indexes matter:
If you query the whole table or most of it, an index on
(influencer_id, date, followers)
can help (a lot) with index-only scans.If you query only a small fragment of the table, an index on
(date)
or(date, influencer_id, followers)
can help (a lot).
2. DISTINCT
& window functions
Gordon also demonstrated DISTINCT
with window functions. Again, can be substantially faster:
SELECT DISTINCT ON (influencer_id)
influencer_id
, last_value(followers) OVER (PARTITION BY influencer_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
- followers AS growth
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
ORDER BY influencer_id, date;
With a single window function, using the same sort order (!) as the main query. To achieve this, we need the non-default window definition with ROWS BETWEEN ...
See:
And DISTINCT ON
instead of DISTINCT
. See:
3. Custom aggregate functions
first()
and last()
You can add those yourself, it's pretty simple. See instructions in the Postgres Wiki.
Or install the additional module first_last_agg
with a faster implementation in C.
Related:
Then your query becomes simpler:
SELECT influencer_id, last(followers) - first(followers) AS growth
FROM (
SELECT influencer_id, followers
FROM influencer_follower_daily
WHERE date >= '2020-03-02'
AND date < '2020-05-09'
ORDER BY influencer_id, date
) z
GROUP BY influencer_id
ORDER BY influencer_id;
Custom aggregate growth()
You can combine first()
and last()
in a single aggregate function. That's faster, but calling two C functions will still outperform one custom SQL function.
Basically encapsulates the logic of my first query in a custom aggregate:
CREATE OR REPLACE FUNCTION f_growth(anyarray)
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $1[array_upper($1, 1)] - $1[1]';
CREATE OR REPLACE AGGREGATE growth(anyelement) (
SFUNC = array_append
, STYPE = anyarray
, FINALFUNC = f_growth
, PARALLEL = SAFE
);
Works for any numeric type (or any type with an operator type - type
returning the same type). The query is simpler, yet:
SELECT influencer_id, growth(followers)
FROM (
SELECT influencer_id, followers
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
ORDER BY influencer_id, date
) z
GROUP BY influencer_id
ORDER BY influencer_id;
Or a little slower, but ultimately short:
SELECT influencer_id, growth(followers ORDER BY date)
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
GROUP BY 1
ORDER BY 1;
db<>fiddle here
4. Performance optimization for many rows per group
With many rows per group / partition, other query techniques can be (a lot) faster. Techniques along these lines:
If that applies, I suggest you start a new question disclosing exact table definition(s) and cardinalities ...
Closely related:
- Get values from first and last row per group
- PostgreSQL: joining arrays within group by clause
- Use something like TOP with GROUP BY
- Best performance in sampling repeated value from a grouped column
这篇关于计算每个影响者随时间的关注者增长的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!