请标识该算法：概率的top-k的元素在数据流中 [英] Please identify this algorithm: probabilistic top-k elements in a data stream

查看：390 发布时间：2015/11/30 16:00:45 algorithm stream

本文介绍了请标识该算法：概率的top-k的元素在数据流中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我记得听到了下面的算法几年前，却找不到任何对它的引用网上。它确定了顶部 K 只使用 M 的柜台元素（或重量级）的 N 元素的数据流。这是寻找顶级的搜索条件，网络滥用等，而使用最少的内存特别有用。

I remember hearing about the following algorithm some years back, but can't find any reference to it online. It identifies the top k elements (or heavy hitters) in a data stream of n elements using only m counters. This is particularly useful for finding top search terms, network abusers, etc. while using minimal memory.

的算法：对于每一个元素

The algorithm: for each element,

如果该元素还没有一个柜台，柜台＆LT; M ，创建为元素的计数器并初始化为1。
否则，如果元素确实有一个柜台，增加它。
否则如果元件不具有计数器和计数器＆GT; M ，减小现有的反 C 。如果 C 为0时，替换其对应的元素，与当前的元素。（ C 是一个索引到现有的柜台，列表，其中<我> C 的增加循环赛的方式，为每个到达这一步的元素。）

If the element does not already have a counter and counters < m, create a counter for the element and initialize to 1.
Else if the element does have a counter, increment it.
Else if the element does not have a counter and counters > m, decrement an existing counter c. If c reaches 0, replace its corresponding element, with the current element. (c is an index into the list of existing counters, where c increases in round robin fashion for each element that reaches this step.)

我发现有很多其他类似的算法（很多都是上市的，虽然没有说明，有关的流算法的），但不是这一个。我特别喜欢它，因为它实现，因为它是描述的那样简单。

I have found many other similar algorithms (many of which are listed, though not described, in this wikipedia article about streaming algorithms), but not this one. I particularly like it because it is as simple to implement as it is to describe.

不过，我想更多地了解它的概率特征 - 如果我只关心在排名前100的项目，什么影响使用1000柜台而非专柜100有哪些？

But I'd like to learn more about its probabilistic characteristics- if I'm only interested in the top 100 items, what effect does using 1,000 counters instead of 100 counters have?

请标识该算法：概率的top-k的元素在数据流中 [英] Please identify this algorithm: probabilistic top-k elements in a data stream

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

请标识该算法：概率的top-k的元素在数据流中 [英] Please identify this algorithm: probabilistic top-k elements in a data stream

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭