计算每个影响者随时间的关注者增长 [英] Calculating follower growth over time for each influencer

查看:63
本文介绍了计算每个影响者随时间的关注者增长的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我每天都有一张桌子,上面有影响者和他们的追随者计数器:

  influencer_id |日期|追随者1 |2020-05-29 |73611 |2020-05-28 |7234...2 |2020-05-29 |822 |2020-05-28 |85...3 |2020-05-29 |34343 |2020-05-28 |29883 |2020-05-27 |2765... 

比方说,我想计算每个影响者在过去7天内获得了多少关注者,并获得了下表:

  influencer_id |生长1 |<前一天的关注者数量-第一天的关注者数量>2 |"3 |" 

我第一次尝试这样做:

  SELECT impactr_id,(最大(跟随者)-最小(跟随者))AS增长来自Impactr_follower_daily日期<'2020-05-30'AND日期> ='2020-05-23'GROUP BY impactr_id; 

这有效,并显示了每个影响者一周内的增长情况.但它假设关注者人数始终在增加,人们永远不会关注!

那么有没有一种方法可以对原始表使用SQL查询来实现我想要的?还是我必须使用 FOR 循环生成一个全新的表,该循环计算每个日期之间的+/-跟随者变化列?

解决方案

标准Postgres中未实现简单的聚合函数 first() last().但是请参见下文.

1. array_agg()

Gordon演示了使用 array_agg() 进行的查询,但这比必要的要昂贵,尤其是每组有很多行时.两次调用时,每个聚合使用 ORDER BY 时,甚至更是如此.这种等效的替代方法应该快得多:

  SELECT impactr_id,arr [array_upper(arr,1)]-arr [1]从  (SELECT impactr_id,array_agg(followers)AS arr从  (SELECTfluencer_id,关注者来自Impactr_follower_daily日期> ='2020-05-23'AND日期<'2020-05-30'依影响者ID排序,日期)sub1GROUP BY impactr_id)sub2; 

因为它对一次进行排序并汇总一次.内部子查询 sub1 的排序顺序将转移到下一个级别.参见:

索引很重要:

  • 如果您查询整个表或其中的大部分表,则(影响者ID,日期,关注者)上的索引可以(在很多方面)为索引提供帮助-仅扫描.

  • 如果仅查询表的一小部分,则在(日期)(日期,influencer_id,关注者)上使用索引 可以提供很多帮助.

2. DISTINCT &窗口功能

Gordon还演示了具有窗口功能的 DISTINCT .同样,可以快得多:

  SELECT DISTINCT ON(influencer_id)impactr_id,last_value(followers)OVER(PARTITION BY impactr_id ORDER BY日期)未绑定的前缀和未绑定的跟随行之间的行)-追随者成长来自Impactr_follower_daily日期> ='2020-05-23'AND日期<'2020-05-30'ORDER BY impactr_id,日期; 

具有单个窗口函数,使用与主要查询相同的排序顺序(!).为此,我们需要非默认的窗口定义,其 ROWS BETWEEN ... 请参见:

DISTINCT ON 代替 DISTINCT .参见:

3.自定义集合函数

first() last()

您可以自己添加这些,这很简单.请参见Postgres Wiki中的说明.
.
或安装
附加模块 first_last_agg 并在C中实现更快的实现./p>

相关:

然后您的查询变得更简单:

  SELECT impactr_id,last(跟随者)-AS(增长)跟随者从  (SELECTfluencer_id,关注者来自Impactr_follower_daily日期> ='2020-03-02'AND日期<'2020-05-09'依影响者ID排序,日期)zGROUP BY impactr_idORDER BY impactr_id; 

自定义聚合 growth()

您可以在单个聚合函数中组合 first() last().这样更快,但是调用两个C函数仍会胜过一个自定义SQL函数.

基本上将我的第一个查询的逻辑封装在一个自定义集合中:

 创建或替换功能f_growth(anyarray)返回任何元素语言SQL不兼容严格并行AS'SELECT $ 1 [array_upper($ 1,1)]-$ 1 [1]';创建或替换总体增长(任何元素)(SFUNC = array_append,STYPE =任何数组,FINALFUNC = f_growth,并行=安全); 

适用于任何数字类型(或任何带有运算符 type-type 并返回相同类型的类型).查询更简单了:

  SELECT impactr_id,增长(关注者)从  (SELECTfluencer_id,关注者来自Impactr_follower_daily日期> ='2020-05-23'AND日期<'2020-05-30'依影响者ID排序,日期)zGROUP BY impactr_idORDER BY impactr_id; 

或更慢,但最终会变短:

 选择impacter_id,增长(跟随者按日期排序)来自Impactr_follower_daily日期> ='2020-05-23'AND日期<'2020-05-30'分组1ORDER BY 1; 

db<>小提琴此处

4.每组许多行的性能优化

每个组/分区有许多行,其他查询技术可以(很多)更快.这些技术:

如果适用,我建议您开始一个新问题,披露确切的表定义和基数...


密切相关:

I have a table with influencers and their follower counter for each day:

influencer_id |     date     |    followers
     1        | 2020-05-29   |      7361
     1        | 2020-05-28   |      7234
                    ...
     2        | 2020-05-29   |       82
     2        | 2020-05-28   |       85
                    ...
     3        | 2020-05-29   |      3434
     3        | 2020-05-28   |      2988
     3        | 2020-05-27   |      2765
                    ...

Let's say I want to calculate how many followers each individual influencer has gained in the last 7 days and get the following table:

influencer_id |                       growth
     1        |  <num followers last day - num followers first day>
     2        |                         "
     3        |                         "

As a first attempt I did this:

SELECT influencer_id,
      (MAX(followers) - MIN(followers)) AS growth
FROM influencer_follower_daily
WHERE date < '2020-05-30'
AND date >= '2020-05-23'
GROUP BY influencer_id;

This works and shows the growth over the week for each influencer. But it assumes the follower count always increases and people never unfollow!

So is there a way to achieve what I want using an SQL query over the original table? Or will I have to generate a completely new table using a FOR loop that calculates a +/- follower change column between each date?

解决方案

The simple aggregate functions first() and last() are not implemented in standard Postgres. But see below.

1. array_agg()

Gordon demonstrated a query with array_agg(), but that's more expensive than necessary, especially with many rows per group. Even more so when called twice, and with ORDER BY per aggregate. This equivalent alternative should be substantially faster:

SELECT influencer_id, arr[array_upper(arr, 1)] - arr[1]
FROM  (
   SELECT influencer_id, array_agg(followers) AS arr
   FROM  (
      SELECT influencer_id, followers
      FROM   influencer_follower_daily
      WHERE  date >= '2020-05-23'
      AND    date <  '2020-05-30'
      ORDER  BY influencer_id, date
      ) sub1
   GROUP  BY influencer_id
   ) sub2;

Because it sorts once and aggregates once. The sort order of the inner subquery sub1 is carried over to the next level. See:

Indexes matter:

  • If you query the whole table or most of it, an index on (influencer_id, date, followers) can help (a lot) with index-only scans.

  • If you query only a small fragment of the table, an index on (date) or (date, influencer_id, followers) can help (a lot).

2. DISTINCT & window functions

Gordon also demonstrated DISTINCT with window functions. Again, can be substantially faster:

SELECT DISTINCT ON (influencer_id)
       influencer_id
     , last_value(followers) OVER (PARTITION BY influencer_id ORDER BY date
                                   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
     - followers AS growth
FROM   influencer_follower_daily
WHERE  date >= '2020-05-23'
AND    date <  '2020-05-30'
ORDER  BY influencer_id, date;

With a single window function, using the same sort order (!) as the main query. To achieve this, we need the non-default window definition with ROWS BETWEEN ... See:

And DISTINCT ON instead of DISTINCT. See:

3. Custom aggregate functions

first() and last()

You can add those yourself, it's pretty simple. See instructions in the Postgres Wiki.
Or install the additional module first_last_agg with a faster implementation in C.

Related:

Then your query becomes simpler:

SELECT influencer_id, last(followers) - first(followers) AS growth
FROM  (
   SELECT influencer_id, followers
   FROM   influencer_follower_daily 
   WHERE  date >= '2020-03-02'
   AND    date <  '2020-05-09'
   ORDER  BY influencer_id, date
   ) z
GROUP  BY influencer_id
ORDER  BY influencer_id;

Custom aggregate growth()

You can combine first() and last() in a single aggregate function. That's faster, but calling two C functions will still outperform one custom SQL function.

Basically encapsulates the logic of my first query in a custom aggregate:

CREATE OR REPLACE FUNCTION f_growth(anyarray)
  RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $1[array_upper($1, 1)] - $1[1]';

CREATE OR REPLACE AGGREGATE growth(anyelement) (
   SFUNC     = array_append
 , STYPE     = anyarray
 , FINALFUNC = f_growth
 , PARALLEL  = SAFE
);

Works for any numeric type (or any type with an operator type - type returning the same type). The query is simpler, yet:

SELECT influencer_id, growth(followers)
FROM  (
   SELECT influencer_id, followers
   FROM   influencer_follower_daily 
   WHERE  date >= '2020-05-23'
   AND    date <  '2020-05-30'
   ORDER  BY influencer_id, date
   ) z
GROUP  BY influencer_id
ORDER  BY influencer_id;

Or a little slower, but ultimately short:

SELECT influencer_id, growth(followers ORDER BY date)
FROM   influencer_follower_daily 
WHERE  date >= '2020-05-23'
AND    date <  '2020-05-30'
GROUP  BY 1
ORDER  BY 1;

db<>fiddle here

4. Performance optimization for many rows per group

With many rows per group / partition, other query techniques can be (a lot) faster. Techniques along these lines:

If that applies, I suggest you start a new question disclosing exact table definition(s) and cardinalities ...


Closely related:

这篇关于计算每个影响者随时间的关注者增长的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆