理解算法,用于测量趋势 [英] Understanding algorithms for measuring trends

查看:113
本文介绍了理解算法,用于测量趋势的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是中的 hive_trend_mapper.py 程序中使用的公式背后的基本原理=htt​​p://www.cloudera.com/blog/2009/07 /跟踪,趋势与 - Hadoop的和巢式-EC2 /相对=nofollow>本的Hadoop教程上计算维基百科的发展趋势?

What's the rationale behind the formula used in the hive_trend_mapper.py program of this Hadoop tutorial on calculating Wikipedia trends?

实际上有两个部分组成:每月的趋势和每日趋势。我将重点放在日常的趋势,但类似的问题,适用于每月之一。

There are actually two components: a monthly trend and a daily trend. I'm going to focus on the daily trend, but similar questions apply to the monthly one.

在日常的趋势,的综合浏览量是每天页面浏览量为这个主题数量的数组,每天一个元素,而 total_pageviews 是这个数组的总和:

In the daily trend, pageviews is an array of number of page views per day for this topic, one element per day, and total_pageviews is the sum of this array:

# pageviews for most recent day
y2 = pageviews[-1]
# pageviews for previous day
y1 = pageviews[-2]
# Simple baseline trend algorithm
slope = y2 - y1
trend = slope  * log(1.0 +int(total_pageviews))
error = 1.0/sqrt(int(total_pageviews))
return trend, error

我知道它在做什么表面上:它只是看起来在过去的一天(斜率)的变化,并扩展这一长达 1 + total_pageviews (日志(1)== 0 ,所以这个比例因子非负)。它可以被看作是治疗一个月的总浏览量为重,但脾气因为它的增长 - 这样一来,总浏览量停止发挥作用的事情是人气不足,但在无关紧要的小别,同时大的变化吨得到重达。

I know what it's doing superficially: it just looks at the change over the past day (slope), and scales this up to the log of 1+total_pageviews (log(1)==0, so this scaling factor is non-negative). It can be seen as treating the month's total pageviews as a weight, but tempered as it grows - this way, the total pageviews stop making a difference for things that are "popular enough," but at the same time big changes on insignificant don't get weighed as much.

不过的为什么的做到这一点?为什么我们要打折的东西,起初不受欢迎呢?如果没有大的增量无所谓的更多的对于具有低恒定的普及项目,和的为已经是流行的(对于这大增量可能一小部分内坠井项目?的标准偏差)作为一个稻草人,何不干脆拿 Y2-Y1 ,并用它做?

But why do this? Why do we want to discount things that were initially unpopular? Shouldn't big deltas matter more for items that have a low constant popularity, and less for items that are already popular (for which the big deltas might fall well within a fraction of a standard deviation)? As a strawman, why not simply take y2-y1 and be done with it?

和什么会为错误是有用的?本教程并没有真正有意义地再次使用它。话又说回来,它并没有告诉我们如何走势被使用,也可以 - 这是何等的绘制中端产品,正确

And what would the error be useful for? The tutorial doesn't really use it meaningfully again. Then again, it doesn't tell us how trend is used either - this is what's plotted in the end product, correct?

我在哪里可以读到了一个(preferably介绍)就在这里的理论背景是什么?是否有这种疯狂的名称?这是一个教科书式的地方?

Where can I read up for a (preferably introductory) background on the theory here? Is there a name for this madness? Is this a textbook formula somewhere?

在此先感谢任何答案(或讨论!)。

Thanks in advance for any answers (or discussion!).

推荐答案

由于在线评论所说,这是一个简单的基本趋势算法, 这基本上意味着你比较两个不同的页面的趋势之前,必须建立 基线。在许多情况下,使用平均值,它是直接的,如果你 暗算时间轴的综合浏览量。这种方法被广泛用于监测 水的质量,空气中的污染物等等,以检测任何显著变化WRT基线

As the in-line comment goes, this is a simple "baseline trend algorithm", which basically means before you compare the trends of two different pages, you have to establish a baseline. In many cases, the mean value is used, it's straightforward if you plot the pageviews against the time axis. This method is widely used in monitoring water quality, air pollutants, etc. to detect any significant changes w.r.t the baseline.

在OP的情况下,综合浏览量的斜率由totalpageviews的日志加权。 这几分使用totalpageviews作为基线校正的斜率。正如西蒙所说的那样,这使平衡 在两个页面具有非常不同的totalpageviews。 对于〔实施例,A有一个斜​​坡500在100万总浏览量,B为1000超过1000。 日志基本上意味着100万的只有两次超过1000(而不是1000次)更重要。 如果只考虑斜率,A是比B冷门 但随着权重,现在的普及的措施是一样的B.我认为这是很直观: 虽然A的浏览量只有500的综合浏览量,但那是因为它的饱和,你还是得给它足够的信任。

In OP's case, the slope of pageviews is weighted by the log of totalpageviews. This sorta uses the totalpageviews as a baseline correction for the slope. As Simon put it, this puts a balance between two pages with very different totalpageviews. For exmaple, A has a slope 500 over 1000,000 total pageviews, B is 1000 over 1,000. A log basically means 1000,000 is ONLY twice more important than 1,000 (rather than 1000 times). If you only consider the slope, A is less popular than B. But with a weight, now the measure of popularity of A is the same as B. I think it is quite intuitive: though A's pageviews is only 500 pageviews, but that's because it's saturating, you still gotta give it enough credit.

至于错误,我相信它是来自(相对)标准误差,它具有因子1 / SQRT(n),其中 n是数据点的数量。在code中,误差等于(1 / SQRT(N))*(1 / SQRT(平均值))。 它大致翻译成:所述多个数据点,更准确的趋势。我没有看到 它是精确的数学公式,只是一个野蛮的趋势分析算法,反正相对 值是在这样的背景下更重要。

As for the error, I believe it comes from the (relative) standard error, which has a factor 1/sqrt(n), where n is the number of data points. In the code, the error is equal to (1/sqrt(n))*(1/sqrt(mean)). It roughly translates into : the more data points, the more accurate the trend. I don't see it is an exact math formula, just a brute trend analysis algorithm, anyway the relative value is more important in this context.

在综上所述,笔者认为这只是一个经验公式。更高级的主题,在某些生物统计学教科书中找到(非常类似于监控流感或类似的突破。)

In summary, I believe it's just an empirical formula. More advanced topics can be found in some biostatistics textbooks (very similar to monitoring the breakout of a flu or the like.)

这篇关于理解算法,用于测量趋势的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆