请标识该算法:概率的top-k的元素在数据流中 [英] Please identify this algorithm: probabilistic top-k elements in a data stream

查看:390
本文介绍了请标识该算法:概率的top-k的元素在数据流中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我记得听到了下面的算法几年前,却找不到任何对它的引用网上。它确定了顶部 K 只使用 M 的柜台元素(或重量级)的 N 元素的数据流。这是寻找顶级的搜索条件,网络滥用等,而使用最少的内存特别有用。

I remember hearing about the following algorithm some years back, but can't find any reference to it online. It identifies the top k elements (or heavy hitters) in a data stream of n elements using only m counters. This is particularly useful for finding top search terms, network abusers, etc. while using minimal memory.

的算法:对于每一个元素

The algorithm: for each element,

  1. 如果该元素还没有一个柜台,柜台< M ,创建为元素的计数器并初始化为1。
  2. 否则,如果元素确实有一个柜台,增加它。
  3. 否则如果元件不具有计数器和计数器&GT; M ,减小现有的反 C 。如果 C 为0时,替换其对应的元素,与当前的元素。 ( C 是一个索引到现有的柜台,列表,其中<我> C 的增加循环赛的方式,为每个到达这一步的元素。)
  1. If the element does not already have a counter and counters < m, create a counter for the element and initialize to 1.
  2. Else if the element does have a counter, increment it.
  3. Else if the element does not have a counter and counters > m, decrement an existing counter c. If c reaches 0, replace its corresponding element, with the current element. (c is an index into the list of existing counters, where c increases in round robin fashion for each element that reaches this step.)

我发现有很多其他类似的算法(很多都是上市的,虽然没有说明,有关的流算法的),但不是这一个。 我特别喜欢它,因为它实现,因为它是描述的那样简单。

I have found many other similar algorithms (many of which are listed, though not described, in this wikipedia article about streaming algorithms), but not this one. I particularly like it because it is as simple to implement as it is to describe.

不过,我想更多地了解它的概率特征 - 如果我只关心在排名前100的项目,什么影响使用1000柜台而非专柜100有哪些?

But I'd like to learn more about its probabilistic characteristics- if I'm only interested in the top 100 items, what effect does using 1,000 counters instead of 100 counters have?

推荐答案

您可能正在寻找的频繁化算法。它使用的 K 的 - 1专柜发现,超过总量​​的1 / K 的所有元素,并通过米斯拉和格里斯出版于1982年。它是博耶和摩尔(或费Salzberg的)多数的算法,其中的 K 的是2。这些和相关算法的一个有用的文章介绍,的的布兰妮斯皮尔斯问题。

You may be looking for the "Frequent" algorithm. It uses k - 1 counters to find all elements that exceed 1/k of the total, and was published in 1982 by Misra and Gries. It's a generalization of Boyer and Moore's (or Fischer-Salzberg's) "Majority" algorithm, where k is 2. These and related algorithms are introduced in a helpful article, "The Britney Spears Problem."

我给详细说明在其他地方的计算器,我这里就不再重复。的重要的一点是,一个道次之后,计数器值并不precisely指示一个项的频率;它们可以根据计数由依赖于流的长度成反比上计数器的数量的余量(的 N / K 的)。所有这些算法(包括Metwally的节省空间)规定的第二通如果希望精确计数,而不是频率的估计值。

I give a detailed explanation of the algorithm elsewhere on StackOverflow, which I won't repeat here. The important point is that, after one pass, the counter values don't precisely indicate the frequency of an item; they can under-count by a margin that depends on the length of the stream and inversely on the number of counters (n / k). All of these algorithms (including Metwally's "SpaceSaving") require a second pass if you want an exact count rather than an estimate of frequency.

这篇关于请标识该算法:概率的top-k的元素在数据流中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆