什么是计算趋势主题或标记的最佳方法是什么? [英] What is the best way to compute trending topics or tags?

查看:243
本文介绍了什么是计算趋势主题或标记的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多网站提供的一些统计数据,如最热门的话题,在过去24小时。例如,Topix.com显示了这个在其节新闻动态。在那里,你可以看到它拥有最快的越来越多的提及的话题。

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions.

我要计算这样的嗡嗡的话题了。我怎么能这样做?该算法应权衡它总是热的少的主题。通常(几乎)没有人提到的主题应该是最热的。

I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) noone mentions should be the hottest ones.

谷歌提供了热门趋势,topix.com显示热门话题,fav.or.it显示关键字趋势 - 所有这些服务都有一个共同点:他们只告诉你这是异常火爆,在即将到来的趋势的时刻。

Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you upcoming trends which are abnormally hot at the moment.

条款,如布兰妮斯皮尔斯,天气或帕丽斯·希尔顿,因为他们总是热的,频繁将不会出现在这些列表。 本文称此为的布兰妮斯皮尔斯问题。

Terms like "Britney Spears", "weather" or "Paris Hilton" won't appear in these lists because they're always hot and frequent. This article calls this "The Britney Spears Problem".

我的问题:你怎么能code算法或使用现有的解决这个问题?有在最后24小时搜索的关键字列表,该算法应该告诉你的10个(例如)最热的。

My question: How can you code an algorithm or use an existing one to solve this problem? Having a list with the keywords searched in the last 24h, the algorithm should show you the 10 (for example) hottest ones.

我知道,在上面的文章中,有某种算法提及。 我在PHP试图code这,但我不认为它会工作。它只是发现多数,不是吗?

I know, in the article above, there is some kind of algorithm mentioned. I've tried to code it in PHP but I don't think that it'll work. It just finds the majority, doesn't it?

我希望你能帮助我(代码示例将是巨大的)。

I hope you can help me (coding examples would be great).

推荐答案

您需要一个算法,衡量一个主题的速度 - 或者换句话说,如果你绘制它,你想显示那些拔地而起以令人难以置信率。

You need an algorithm that measures the velocity of a topic - or in other words, if you graph it you want to show those that are going up at an incredible rate.

这是趋势线的一阶导数,并且它不是很难将作为整体的计算的加权因子。

This is the first derivative of the trend line, and it is not difficult to incorporate as a weighted factor of your overall calculation.

标准化

你需要做的一种方法是归您所有的数据。因为你是下面每个主题,保持一个非常低通滤波器,用于定义主题的基准。现在进来的有关话题进行归每个数据点 - 减去其基线,你会得到所有的题目接近0,尖峰上方和下方的线。你可以改为希望通过其基准大小来划分信号,这将带来信号到周围1.0 - 这不仅带来了在彼此线的所有信号(标准化基线),而且还规格化尖峰。一个布兰妮秒杀将是幅度比别人的穗大,但是,这并不意味着你就应该注意它 - 秒杀可能是相对于她的基线非常小的。

One technique you'll need to do is to normalize all your data. For each topic you are following, keep a very low pass filter that defines that topic's baseline. Now every data point that comes in about that topic should be normalized - subtract its baseline and you'll get ALL of your topics near 0, with spikes above and below the line. You may instead want to divide the signal by its baseline magnitude, which will bring the signal to around 1.0 - this not only brings all signals in line with each other (normalizes the baseline), but also normalizes the spikes. A britney spike is going to be magnitudes larger than someone else's spike, but that doesn't mean you should pay attention to it - the spike may be very small relative to her baseline.

导出

一旦你归了一切,找出每个主题的斜率。取两个连续点,并测量差。正差趋于上升,负差值呈下降趋势。然后,你可以比较标准化的差异,并找出主题受欢迎拍摄向上相对于其他议题 - 与每个题目比例适合它自己的'正常'这可能是为了与其他主题不同幅度

Once you've normalized everything, figure out the slope of each topic. Take two consecutive points, and measure the difference. A positive difference is trending up, a negative difference is trending down. Then you can compare the normalized differences, and find out what topics are shooting upward in popularity compared to other topics - with each topic scaled appropriate to it's own 'normal' which may be magnitudes of order different from other topics.

这是一个真正的首过这个问题。还有,你需要使用(上述与其他算法,加权多为组合,以满足您的需求),更先进的技术,但它应该足以让你开始。

This is really a first-pass at the problem. There are more advanced techniques which you'll need to use (mostly a combination of the above with other algorithms, weighted to suit your needs) but it should be enough to get you started.

有关的文章

这篇文章是关于主题的趋势,但它不是关​​于如何计算什么是热的,什么不是,它是关于如何处理大量的信息,这样的算法必须处理​​的像Lycos和谷歌的地方。在空间和时间必须给每个主题一个计数器,找到每个主题的柜台上时,它的搜索经历是巨大的。这篇文章是约一尝试这样的任务时面临的挑战。它确实提到了布兰妮的效果,但它并没有谈到如何克服它。

The article is about topic trending, but it's not about how to calculate what's hot and what's not, it's about how to process the huge amount of information that such an algorithm must process at places like Lycos and Google. The space and time required to give each topic a counter, and find each topic's counter when a search on it goes through is huge. This article is about the challenges one faces when attempting such a task. It does mention the Brittney effect, but it doesn't talk about how to overcome it.

作为<一个href="http://stackoverflow.com/questions/826330/britney-spears-problem-how-to-solve/826509#826509">Nixuz指出的这也被称为Z或标准得分

As Nixuz points out this is also referred to as a Z or Standard Score.

这篇关于什么是计算趋势主题或标记的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆