计算热门话题或标签的最佳方法是什么? [英] What is the best way to compute trending topics or tags?

查看:13
本文介绍了计算热门话题或标签的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多网站提供一些统计数据,例如过去 24 小时内最热门的话题".例如,Topix.com 在其新闻趋势"部分中显示了这一点.在那里,您可以看到提及次数增长最快的主题.

我想计算这样一个嗡嗡声";对于一个话题,太.我怎么能这样做?该算法应该对总是不太热的主题进行加权.通常(几乎)没人提及的话题应该是最热门的话题.

Google 提供热门趋势",topix.com 显示热门话题",fav.or.it 显示关键字趋势"- 所有这些服务都有一个共同点:它们只向您展示当前异常热门的即将到来的趋势.

诸如Britney Spears"、weather"之类的术语或巴黎希尔顿"不会出现在这些列表中,因为它们总是热门且频繁.这篇文章称之为小甜甜问题".p>

我的问题:如何编写算法或使用现有算法来解决这个问题?拥有一个包含过去 24 小时内搜索过的关键字的列表,该算法应该会显示 10 个(例如)最热门的关键字.

我知道,在上面的文章中,提到了某种算法.我试过用 PHP 编写代码 但我不这么认为它会工作的.它只是找到大多数,不是吗?

我希望你能帮助我(编码示例会很棒).

解决方案

这个问题需要一个z-score或者标准分,会考虑历史平均值,就像其他人提到的那样,还要考虑标准差这些历史数据,使其比仅使用平均值更可靠.

在您的情况下,z 分数是通过以下公式计算的,其中趋势将是诸如观看次数/天之类的比率.

z-score = ([当前趋势] - [平均历史趋势])/[历史趋势的标准差]

当使用 z-score 时,z-score 越高或越低,趋势越不正常,例如,如果 z-score 高度为正,则趋势异常上升,而如果它为高度负,则趋势异常上升.异常下降.因此,一旦您计算了所有候选趋势的 z 分数,最高的 10 个 z 分数将与最异常增加的 z 分数相关.

有关 z 分数的更多信息,请参阅 维基百科.

代码

从数学导入 sqrtdef zscore(obs, pop):# 人口规模.数字 = 浮动(len(pop))# 平均人口值.avg = sum(pop)/number# 总体标准差.std = sqrt(sum(((c - avg) ** 2) for c in pop)/number)# Zscore 计算.返回(obs - avg)/标准

样本输出

>>>zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])3.5>>>zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])0.0739221270955>>>zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])1.00303599234>>>zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])-0.922793112954>>>zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])1.65291949506

备注

  • 如果您不想考虑太多历史记录,您可以将此方法与滑动窗口(即过去 30 天)一起使用,这将使短期趋势更加明显,并可以减少处理时间.

  • 您还可以使用 z 分数来表示从一天到第二天的观看次数变化等值,以找出每天增加/减少观看次数的异常值.这就像使用每日观看次数图表的斜率或导数一样.

  • 如果你跟踪当前人口规模、当前人口总数和当前人口总数 x^2,则无需重新计算这些值,只需更新它们因此您只需要为历史保留这些值,而不是每个数据值.下面的代码演示了这一点.

     from math import sqrt班级zscore:def __init__(self, pop = []):self.number = float(len(pop))self.total = sum(pop)self.sqrTotal = sum(x ** 2 for x in pop)定义更新(自我,价值):self.number += 1.0self.total += 价值self.sqrTotal += 值 ** 2默认平均值(自我):返回 self.total/self.number定义标准(自我):返回 sqrt((self.sqrTotal/self.number) - self.avg() ** 2)def分数(自我,obs):返回 (obs - self.avg())/self.std()

  • 使用此方法,您的工作流程如下.为每个主题、标签或页面创建一个浮点字段,用于显示数据库中的总天数、查看次数总和以及查看次数平方和.如果您有历史数据,请使用该数据初始化这些字段,否则初始化为零.在每天结束时,使用当天的查看次数与存储在三个数据库字段中的历史数据计算 z 分数.具有最高 X z 分数的主题、标签或页面是您的 X最热门趋势";的一天.最后用当天的值更新 3 个字段中的每一个,并在第二天重复该过程.

新增功能

上面讨论的正常 z 分数不考虑数据的顺序,因此观察1"或9"的 z 分数与序列 [1, 1,1、1、9、9、9、9].显然,对于趋势发现,最新数据应该比旧数据具有更大的权重,因此我们希望1"观察值比9"观察值具有更大的量值分数.为了实现这一点,我提出了一个浮动平均 z 分数.应该清楚的是,这种方法不能保证在统计上是合理的,但应该对趋势发现或类似方法有用.标准 z-score 和浮动平均 z-score 的主要区别在于使用浮动平均数来计算平均人口值和平均人口值的平方.详情见代码:

代码

类fazscore:def __init__(自我,衰变,流行= []):self.sqrAvg = self.avg = 0# 历史数据影响减弱的速率.self.decay = 衰变对于流行音乐中的 x:self.update(x)定义更新(自我,价值):# 将初始平均值设置为序列中的第一个值.如果 self.avg == 0 和 self.sqrAvg == 0:self.avg = 浮点数(值)self.sqrAvg = float((value ** 2))# 使用a计算其余值的平均值# 浮动平均值.别的:self.avg = self.avg * self.decay + value * (1 - self.decay)self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)回归自我定义标准(自我):# 有点特别的标准差计算.返回 sqrt(self.sqrAvg - self.avg ** 2)def分数(自我,obs):if self.std() == 0: return (obs - self.avg) * float("infinity")否则:返回(obs - self.avg)/self.std()

示例 IO

>>>fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)-1.67770595327>>>fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)0.596052006642>>>fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)3.46442230724>>>fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)7.7773245459>>>fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)-0.24633160155>>>fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)1.1069362749>>>fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)-0.786764452966>>>fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)1.82262469243>>>fazscore(0.8, [40] * 200).score(1)-inf

更新

正如 David Kemp 正确指出的那样,如果给定一系列常量值,然后请求与其他值不同的观察值的 zscore,则结果可能应该是非零的.实际上返回的值应该是无穷大.所以我改变了这一行,

如果 self.std() == 0: 返回 0

到:

if self.std() == 0: return (obs - self.avg) * float("infinity")

此更改反映在 fazscore 解决方案代码中.如果不想处理无限值,可接受的解决方案可能是将行改为:

如果 self.std() == 0: 返回 obs - self.avg

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions.

I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) no one mentions should be the hottest ones.

Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you upcoming trends which are abnormally hot at the moment.

Terms like "Britney Spears", "weather" or "Paris Hilton" won't appear in these lists because they're always hot and frequent. This article calls this "The Britney Spears Problem".

My question: How can you code an algorithm or use an existing one to solve this problem? Having a list with the keywords searched in the last 24h, the algorithm should show you the 10 (for example) hottest ones.

I know, in the article above, there is some kind of algorithm mentioned. I've tried to code it in PHP but I don't think that it'll work. It just finds the majority, doesn't it?

I hope you can help me (coding examples would be great).

解决方案

This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mentioned, but also the standard deviation of this historical data, making it more robust than just using the average.

In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day.

z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]

When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores.

Please see Wikipedia for more information, about z-scores.

Code

from math import sqrt

def zscore(obs, pop):
    # Size of population.
    number = float(len(pop))
    # Average population value.
    avg = sum(pop) / number
    # Standard deviation of population.
    std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
    # Zscore Calculation.
    return (obs - avg) / std

Sample Output

>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
3.5
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
0.0739221270955
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
1.00303599234
>>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
-0.922793112954
>>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
1.65291949506

Notes

  • You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.

  • You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.

  • If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.

      from math import sqrt
    
      class zscore:
          def __init__(self, pop = []):
              self.number = float(len(pop))
              self.total = sum(pop)
              self.sqrTotal = sum(x ** 2 for x in pop)
          def update(self, value):
              self.number += 1.0
              self.total += value
              self.sqrTotal += value ** 2
          def avg(self):
              return self.total / self.number
          def std(self):
              return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)
          def score(self, obs):
              return (obs - self.avg()) / self.std()
    

  • Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process next day.

New Addition

Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details:

Code

class fazscore:
    def __init__(self, decay, pop = []):
        self.sqrAvg = self.avg = 0
        # The rate at which the historic data's effect will diminish.
        self.decay = decay
        for x in pop: self.update(x)
    def update(self, value):
        # Set initial averages to the first value in the sequence.
        if self.avg == 0 and self.sqrAvg == 0:
            self.avg = float(value)
            self.sqrAvg = float((value ** 2))
        # Calculate the average of the rest of the values using a 
        # floating average.
        else:
            self.avg = self.avg * self.decay + value * (1 - self.decay)
            self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
        return self
    def std(self):
        # Somewhat ad-hoc standard deviation calculation.
        return sqrt(self.sqrAvg - self.avg ** 2)
    def score(self, obs):
        if self.std() == 0: return (obs - self.avg) * float("infinity")
        else: return (obs - self.avg) / self.std()

Sample IO

>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
-1.67770595327
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
0.596052006642
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
3.46442230724
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
7.7773245459
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
-0.24633160155
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
1.1069362749
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
-0.786764452966
>>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
1.82262469243
>>> fazscore(0.8, [40] * 200).score(1)
-inf

Update

As David Kemp correctly pointed out, if given a series of constant values and then a zscore for an observed value which differs from the other values is requested the result should probably be non-zero. In fact the value returned should be infinity. So I changed this line,

if self.std() == 0: return 0

to:

if self.std() == 0: return (obs - self.avg) * float("infinity")

This change is reflected in the fazscore solution code. If one does not want to deal with infinite values an acceptable solution could be to instead change the line to:

if self.std() == 0: return obs - self.avg

这篇关于计算热门话题或标签的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆