用指数平滑和不规则事件估计事件的发生率 [英] Estimating rate of occurrence of an event with exponential smoothing and irregular events

查看:121
本文介绍了用指数平滑和不规则事件估计事件的发生率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下,我有一组由多个进程 x 0 ... x < sub> N 在时间 t 0 ... t N 。假设在时间 t 时,我想基于当前没有已知的长期趋势,而 x 的当前值进行估计。 i> x 可以通过诸如指数平滑之类的算法进行预测。由于我们有许多过程,并且 N 可能会变得非常大,所以我不能存储多个值(例如,先前的状态)。

Imagine that I have a set of measurements of x that are taken by many processes x0 ... xN at times t0 ... tN. Let's assume that at time t I want to make an estimate of the current value of x, based on the assumption that there is no long term trend I know about and that x can be predicted from an algorithm such as exponential smoothing. As we have many processes, and N can get very large, I can't store more than a few values (e.g. the previous state).

这里的一种方法是适应普通指​​数平滑算法。如果定期取样,我将维持估算器 y n 使得:

One approach here would be to adapt the normal exponential smoothing algorithm. If samples are taken regularly, I would maintain an estimator yn such that:

y < sub> n = α y n-1 +(1-α )。 x n

yn = α.yn-1 + ( 1 - α ). xn

这种方法在采样不规则的地方效果不佳,因为许多采样加在一起会产生不成比例的影响。因此,该公式可以适用于:

This approach is not great where the sampling is irregular as many samples together would have a disproportionate influence. Therefore this formula could be adapted to:

y n = α n y n-1 +(1-α n )。 x n

yn = αn.yn-1 + ( 1 - αn ). xn

其中

α n = e -k。(t n -t n-1

IE根据前两个样本之间的间隔动态调整平滑常数。我对这种方法很满意,而且似乎可行。这是在此处给出的第一个答案,并且很好地总结了这些类型埃克纳(Eckner)在2012年的这篇论文中介绍了一些技术( PDF )。

IE dynamically adjusting the smoothing constant depending on the interval between the previous two samples. I'm happy with this approach and it seems to work. It's the first answer given here, and a good summary of these sorts of techniques are given by Eckner in this 2012 paper (PDF).

现在,我的问题如下。我想对以上内容进行调整以估计发生率。有时会发生一个事件。使用类似的指数技术,我想获得事件发生率的估计。

Now, my question is as follows. I want to adapt the above to estimate the rate of an occurrence. Occasionally an event will occur. Using a similar exponential technique, I want to get an estimate of the rate that event occurs.

两个明显的策略是:


  • 要使用第一种或第二种方法,将最后两个事件之间的延迟用作数据序列 x n 。 / li>
  • 使用第一种或第二种技术,将最后两个事件之间的延迟的倒数(即速率)作为数据序列 x n

据我所知,这两个都不是好策略。首先,每隔500ms发生一次事件,另一方面发生200ms延迟和800ms延迟的事件。显然,这两者每秒发生两次,因此给出的速率估计应该相同。忽略上一个样本的时间似乎很困难,因此我将集中讨论第二个策略。事实证明,使用延迟(而不是倒数)并不是一个好的预测指标,因为模拟200ms / 800ms样本流会产生大约1.5的估计值(基于倒数的平均值不是平均值的倒数)。

Neither of these turn out to be good strategies as far as I can tell. Firstly, take an event that occurs every 500ms (on the one hand) and an event that occurs with a 200ms delay and an 800ms delay on the other. Clearly these both occur twice a second, so the rate estimate given should be the same. Ignoring the time from the last sample seems foolhardy, so I'll concentrate on the second strategy. Using the delay (rather than the reciprocal) does not turn out to be a good predictor because the simulating the 200ms/800ms sample stream produces an estimate of about 1.5 (on the basis the average of reciprocals is not the reciprocal of the average).

但是,更重要的是,两种策略都无法应对实际发生的事情,这就是所有事件突然停止了很长一段时间。因此, y 的最新值就是最后一个事件的值,因此永远无法计算出速率的估算值。因此,速率似乎恒定。当然,如果我要追溯分析数据,这不是问题,但是我正在实时分析数据。

But, far more significantly, neither strategy copes with what is actually happening in practice, which is that suddenly all the events stop for a long while. The 'latest' value of y is thus the value at the last event, and the estimate of the rate thus never gets calculated. Hence the rate appears constant. Of course if I was analysing the data retrospectively, this wouldn't be an issue, but I'm analysing it in real time.

我意识到另一种方法将定期(例如,每10秒)运行某个线程并计算这10秒间隔内的出现次数。由于不需要经常使用统计信息,因此这非常浪费资源,我不愿意运行一个线程来轮询由于互斥问题引起的所有问题。因此,我想(以某种方式)使用一种算法,该算法会根据(例如)自上次采样以来的时间调整读取状态。这似乎是一种合理的方法,好像性能是在独立于样本选择的时间进行测量的,测量时间平均为样本之间间隔的一半,因此,非常粗略的,平滑的速率估计值将是样本值倒数的一半。自上次采样以来的时间。更复杂的是,我的测量时间将不依赖于样本。

I realise another way to do this would be to run some thread periodically (e.g. every 10 seconds) and count the number of occurrences in this 10 second interval. This is very resource heavy my end as the statistics are not needed often, and I am loathe to run a thread that polls everything due to mutex issues. Therefore I'd like to (somehow) use an algorithm which adjusts the state read by (e.g.) the time since the last sample was taken. This seems a reasonable approach as if the performance is measured at times chosen independently of the samples, the measurement time will on average be half way through the period between samples, so a very crude unsmoothed estimate of the rate would be half the reciprocal of the time since the last sample. To complicate things further, my measurement time will not be independent of the samples.

我感觉这是一个简单的答案,但它使我难以理解。我觉得正确的方法是假设事件是泊松分布的,并根据自上次采样以来的时间间隔和某种形式的移动平均值得出λ 的估计值,但我的统计数据是太生锈了,无法进行这项工作。

I have a feeling this has a simple answer but it's eluding me. I have a feeling that the correct route is to assume the the events are Poisson distributed, and derive an estimate for λ based on the interval since the last sample and some form of moving average, but my statistics is too rusty to make this work.

这个问题几乎是重复的,但答案似乎并不令人满意(我希望我能解释一下原因)。我要补充一点,鉴于我只有一个变量可以估算而一无所知,因此卡尔曼滤波器似乎可以减轻重量。还有许多其他近乎重复的方法,其中大多数都建议保留较大的值箱(从内存的角度来看,在这里不现实)或不解决上述两个问题。

There is a near dupe of this question here but the answer doesn't seem to be very satisfactory (I hope I explained why). I'd add that a Kalman filter seems way to heavyweight given I have one variable to estimate and know nothing about it. There are a number of other near dupes, most of which either suggest keeping large bins of values (not realistic here from a memory point of view) or do not address the two issues above.

推荐答案

首先,如果您假设事件本身的发生率是恒定的(或者您仅对事件的长期平均值感兴趣),则可以简单地进行估算

First, if you assume that the occurrence rate of the events itself is constant (or that you're only interested in its long-term average), then you can simply estimate it as:

     
λ * = N /( t t 0

        λ* = N / (tt0)

其中 t 是当前时间, t 0 是观测的开始, N 是自 t 0 以来观察到的事件数,而λ *是真实频率λ的估计值。

where t is the current time, t0 is the start of observations, N is the number of events observed since t0 and λ* is the estimate of the true frequency λ.

在这一点上,值得注意的是,上面给出的估计公式可以重新表述为积分:

At this point, it's useful to note that the estimation formula given above may be reformulated as the integral:

       
λ * =积分(δ event (τ)dτ)/积分(1 dτ)

        λ* = integral( δevent(τ) dτ ) / integral( 1 dτ )

其中积分变量τ范围从 t 0 t ,并且δ event (τ)= sum(δ(&tau ;− t i ), i = 1 .. N )是狄拉克增量函数的总和 N ,并且每个事件 i 的发生时间 t i

where the variable of integration τ ranges from t0 to t, and δevent(τ) = sum( δ(τ − ti), i = 1 .. N ) is a sum N of Dirac delta functions, with a single delta-peak at the occurrence time ti of each event i.

当然,这对于计算λ *而言将是完全无用的方式,但事实证明这在概念上是有用的。基本上,查看此公式的方式是函数<ta; event (τ)测量瞬时事件速率在事件τ处增加的瞬时速率,而第二个被积仅是常数1衡量时间随时间增加的速率(当然,仅仅是每秒一秒钟)。

Of course, this would be a completely useless way to calculate λ*, but it turns out to be a conceptually useful formulation. Basically, the way to view this formula is that the function δevent(τ) measures the instantaneous rate at which the number of events increases at time τ, while the second integrand, which is just the constant 1, measures the rate at which time increases over time (which, of course, is simply one second per second).

好,但是如果频率λ本身可能会随着时间而变化,并且您想估计其当前值,或者至少是最近一段时间的平均值?

OK, but what if the frequency λ itself may change over time, and you want to estimate its current value, or at least its average over a recent period?

使用该比率上面给出的-of-of-integrals公式,我们可以通过使用一些偏向于最近时间的加权函数 w (τ)加权两个被积分数来简单地获得这样的估计值:

Using the ratio-of-integrals formulation given above, we can obtain such an estimate simply by weighing both integrands by some weighing function w(τ) which is biased towards recent times:

       
λ * recent =积分(δ event (τ) w (τ)dτ)/积分( w (τ)dτ)

        λ*recent = integral( δevent(τ) w(τ) dτ ) / integral( w(τ) dτ )

现在,剩下的就是选择一个合理的 w (&tau ;),以便将这些积分简化为易于计算的内容。事实证明,如果我们选择形式为 w (τ)= exp( k (τ− t ))对于某些衰减率 k ,积分简化为:

Now, all that remains is to pick a reasonable w(τ) such that these integrals simplify to something easy to calculate. As it turns out, if we choose an exponentially decaying weighing function of the form w(τ) = exp(k(τ − t)) for some decay rate k, the integrals simplify to:

       
λ * recent = sum(exp( k t i t )), i = 0 .. N k /(1− exp ( k t 0 t )))

        λ*recent = sum( exp(k(tit)), i = 0 .. N ) k / ( 1 − exp(k(t0t)) )

限制为 t 0 → −∞ (即,实际上,当总观测时间( t t 0 )远大于权重衰减时标1 / k ),进一步简化为:

In the limit as t0 → −∞ (i.e., in practice, when the total observation time (tt0) is much larger than the weight decay timescale 1/k), this further simplifies to just:

       
λ * recent = k sum(exp( k t i t )), i = 0 .. N

        λ*recent = k sum( exp(k(tit)), i = 0 .. N )

A,天真地应用此公式仍将要求我们记住所有事件时间 t i 。但是,我们可以使用与计算通常的指数加权平均值相同的技巧-给定加权平均事件发生率λ * recent t')在更早的时间 t',并假设没有新事件发生在 t' t 之间,我们可以计算当前的加权平均事件发生率λ * recent t )只需:

Alas, naïvely applying this formula would still require us to remember all the event times ti. However, we can use the same trick as for calculating usual exponentially weighted averages — given the weighted average event rate λ*recent(t') at some earlier time t', and assuming that no new events have occurred between t' and t, we can calculate the current weighted average event rate λ*recent(t) simply as:

     
λ * recent t )= exp( k t' t ))λ * recent t'

        λ*recent(t) = exp( k(t't) ) λ*recent(t')

如果我们现在观察到恰好在时间 t 发生的新事件,事件之后的加权平均事件发生率 变为:

Further, if we now observe a new event occurring at exactly time t, the weighted average event rate just after the event becomes:

       
λ * recent t )= k + exp( k t ' t ))λ * recent t'

        λ*recent(t) = k + exp( k(t't) ) λ*recent(t')

因此,我们得到一个非常简单的规则:我们需要存储的只是时间的 t last 先前观察到的事件,以及紧接所述事件之后的估计最近事件发生率λ** last 。 (我们可以将其初始化为 t last = t 0 和λ * last = 0;实际上,对于λ * last = 0, t last 的值没有区别,尽管对于非零λ * last 。)

Thus, we get a very simple rule: all we need to store is the time tlast of the previous observed event, and the estimated recent event rate λ*last just after said event. (We may initialize these e.g. to tlast = t0 and λ*last = 0; in fact, with λ*last = 0, the value of tlast makes no difference, although for non-zero λ*last it does.)

每当发生新事件时(在时间 t new ),我们将这些值更新为:

Whenever a new event occurs (at time tnew), we update these values as:

       
λ * last k + exp( k t last t new ))λ * last

       
t last t new

        λ*lastk + exp( k(tlasttnew) ) λ*last
        tlasttnew

每当我们希望知道当前时间的近期平均事件发生率 t ,我们只需将其计算为:

and whenever we wish to know the recent event rate average at the current time t, we simply calculate it as:

       
λ *( t )= exp( k t last t ))λ * last

        λ*(t) = exp( k(tlastt) ) λ*last

Ps。要纠正对(em)t last 的(任意)初始值的初始偏差,我们可以将1 /(1− exp(假设我们早先简化了k t 0 t )))校正项≫ t 0 。为此,只需从 t last = 0开始,在 t = t 0 ,如上所述,更新 t last ,但计算时间 t 的估计最近事件发生率平均值为:

Ps. To correct for the initial bias towards the (arbitrary) initial value of tlast, we can add back the 1 / ( 1 − exp(k(t0t)) ) correction term that we simplified out earlier when we assumed that tt0. To do that, simply start from tlast = 0 at t = t0, update tlast as above, but calculate the estimated recent event rate average at time t as:

       
λ * corr t )= exp( k t last t ))λ * last /(1− exp( k t 0 t )))

        λ*corr(t) = exp( k(tlastt) ) λ*last / ( 1 − exp(k(t0t)) )

(此处, t 0 表示您开始测量事件的时间,而不是第一次事件的发生。)

(Here, t0 denotes the time at which you start measuring events, not the occurrence of the first event.)

初始偏差为零,其代价是增加了早期方差。这是一个示例图,显示了 k = 0.1和真实平均事件发生率为2时的校正效果:

This will eliminate the initial bias towards zero, at the cost of increasing the early variance. Here's an example plot showing the effects of the correction, for k = 0.1 and a true mean event rate of 2:



红线显示λ * ( t )没有初始偏差校正(从λ *( t 0 )= 0开始),而绿线则显示偏差校正后的估计值λ * corr t )。


The red line shows λ*(t) without the initial bias correction (starting from λ*(t0) = 0), while the green line shows the bias-corrected estimate λ*corr(t).

Pps。如上图所示,如上计算的λ *不会是时间的连续函数:只要发生事件,它就会跳起 k

Pps. As the plot above shows, λ*, as calculated above, will not a be continuous function of time: it jumps up by k whenever an event occurs, and decays exponentially towards zero when events do not occur.

如果您希望使用更平滑的估计,则可以计算λ *本身的指数衰减平均值:

If you'd prefer a smoother estimate, you can calculate an exponentially decaying average of λ* itself:

     
λ **( t )=积分(λ *(τ)exp( k 2 (τ&minus ; t ))dτ)/积分(exp( k 2 (τ− t ))) dτ)

        λ**(t) = integral( λ*(τ) exp(k2(τ − t)) dτ ) / integral( exp(k2(τ − t)) dτ )

其中λ *是如上计算的指数衰减平均事件发生率, k 2 是第二个平均值的衰减率,并且积分超过&min;∞ < τ ≤ t

where λ* is the exponentially decaying average event rate as calculated above, k2 is the decay rate for the second average, and the integrals are over −∞ < τ ≤ t.

该积分也可以通过上述逐步更新规则来计算:

This integral can also be calculated by a step-wise update rule as above:

       
λ ** last W (Δ t )λ * last + exp(− k 2 Δ t )λ ** last

       
λ * last k 1 + exp(− k 1 Δ t )λ * 最后

       
t last t new

        λ**lastWt) λ*last + exp( −k2 Δt ) λ**last
        λ*lastk1 + exp( −k1 Δt ) λ*last
        tlasttnew

其中 k 1 k 2 是第一和第二平均值的衰减率,< t = t new t last 是事件之间经过的时间,并且:

where k1 and k2 are the decay rates for the first and second averages, Δt = tnewtlast is the elapsed time between the events, and:

       
W (Δ t )= k 2 (exp(− k 2 Δ t )− exp(− k 1 Δ t ))/( k 1 k 2

        Wt) = k2 ( exp( −k2 Δt ) − exp( −k1 Δt ) ) / (k1k2)

如果 k 1 k 2 ,或

if k1k2, or

       
W (Δ t )= k Δ t exp(− k Δ t

        Wt) = k Δt exp( −k Δt )

如果 k 1 = k 2 = k (后一种表达式由前者引起,当( k 1 k 2 )→ 0)。

if k1 = k2 = k (the latter expression arising from the former as the limit when (k1k2) → 0).

计算任意值的第二个平均值时间点 t ,使用相同的公式:

To calculate the second average for an arbitrary point in time t, use the same formula:

       
λ **( t )= W (Δ t )λ * last + exp(− k 2 Δ t )λ ** last

        λ**(t) = Wt) λ*last + exp( −k2 Δt ) λ**last

除了Δ t = t t last

except with Δt = ttlast.

如上所述,该估算值也可以通过应用适当的与时间相关的比例因子进行偏差校正:

As above, this estimate can also be bias-corrected by applying a suitable time-dependent scaling factor:

       
λ ** corr t )=λ **( t )/(1- S t t 0 )))

        λ**corr(t) = λ**(t) / (1 - S(tt0))

其中:

     
S (Δ t )=( k 1 exp(− k 2 Δ t )− k 2 exp(− k 1 Δ t ))/( k 1 k 2

        St) = ( k1 exp( −k2 Δt ) − k2 exp( −k1 Δt ) ) / (k1k2)

如果 k 1 k 2 ,或

if k1k2, or

       
S (Δ t )=(1 + k Δ t )exp(&minus ; k Δ t

        St) = (1 + k Δt) exp( −k Δt )

sub> = k 2 = k

if k1 = k2 = k.

下图显示了效果这种平滑。红线和绿线如上所述,分别显示λ *( t )和λ * corr t ),而黄线和蓝线显示λ **( t )和λ ** corr t ),用 k 计算 1 = 0.1(如上所述), k 2 = 0.2:

The plot below shows the effects of this smoothing. The red and green lines show λ*(t) and λ*corr(t) as above, while the yellow and blue lines show λ**(t) and λ**corr(t), as calculated with k1 = 0.1 (as above) and k2 = 0.2:

< img src = https://i.stack.imgur.com/mRuti.gif alt =λ *和λ **随时间推移的图,带有或不带有初始偏差校正>

这篇关于用指数平滑和不规则事件估计事件的发生率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆