从整数流中查找运行中位数 [英] Find running median from a stream of integers

查看:27
本文介绍了从整数流中查找运行中位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

<块引用>

可能的重复:
C 中的滚动中值算法

<块引用>

鉴于整数是从数据流中读取的.以有效的方式查找到目前为止读取的元素的中位数.

我读过的解决方案:我们可以在左侧使用最大堆来表示小于有效中位数的元素,在右侧使用最小堆来表示大于有效中位数的元素.

处理传入的元素后,堆中元素的数量最多相差1个元素.当两个堆包含相同数量的元素时,我们将堆的根数据的平均值作为有效中位数.当堆不平衡时,我们从包含更多元素的堆的根中选择有效中位数.

但是我们如何构建最大堆和最小堆,即我们如何知道这里的有效中位数?我认为我们会在最大堆中插入 1 个元素,然后在最小堆中插入下一个元素,对所有元素依此类推.如果我错了,请纠正我.

解决方案

从流数据中寻找运行中位数有多种不同的解决方案,我将在答案的最后简要讨论它们.

问题是关于特定解决方案(最大堆/最小堆解决方案)的细节,基于堆的解决方案的工作原理解释如下:

对于前两个元素,向左侧的 maxHeap 添加较小的一个,向右侧的 minHeap 添加较大的一个.然后一一处理流数据,

第 1 步:将下一项添加到其中一个堆中如果下一项小于 maxHeap root,则将其添加到 maxHeap,否则将其添加到 minHeap第 2 步:平衡堆(在此步骤之后,堆将被平衡或其中之一将包含另外 1 个项目)如果一个堆中的元素数量比另一个多大于 1,从包含更多元素的元素中删除根元素,然后添加到另一个

然后在任何给定时间,您都可以像这样计算中位数:

 如果堆包含等量的元素;中位数 =(maxHeap 的根 + minHeap 的根)/2别的中位数 = 具有更多元素的堆的根

现在我将按照答案开头所承诺的一般性地讨论问题.从数据流中寻找运行中位数是一个棘手的问题,对于一般情况,有效地找到具有内存限制的精确解决方案可能是不可能的.另一方面,如果数据具有一些我们可以利用的特征,我们就可以开发出高效的专业解决方案.例如,如果我们知道数据是整数类型,那么我们可以使用计数排序,它可以为您提供恒定内存恒定时间算法.基于堆的解决方案是一种更通用的解决方案,因为它也可以用于其他数据类型(双精度).最后,如果不需要精确的中位数并且近似值就足够了,您可以尝试估计数据的概率密度函数并使用它估计中位数.

Possible Duplicate:
Rolling median algorithm in C

Given that integers are read from a data stream. Find median of elements read so far in efficient way.

Solution I have read: We can use a max heap on left side to represent elements that are less than the effective median, and a min heap on right side to represent elements that are greater than the effective median.

After processing an incoming element, the number of elements in heaps differ at most by 1 element. When both heaps contain the same number of elements, we find the average of heap's root data as effective median. When the heaps are not balanced, we select the effective median from the root of heap containing more elements.

But how would we construct a max heap and min heap i.e. how would we know the effective median here? I think that we would insert 1 element in max-heap and then the next 1 element in min-heap, and so on for all the elements. Correct me If I am wrong here.

解决方案

There are a number of different solutions for finding running median from streamed data, I will briefly talk about them at the very end of the answer.

The question is about the details of the a specific solution (max heap/min heap solution), and how heap based solution works is explained below:

For the first two elements add smaller one to the maxHeap on the left, and bigger one to the minHeap on the right. Then process stream data one by one,

Step 1: Add next item to one of the heaps

   if next item is smaller than maxHeap root add it to maxHeap,
   else add it to minHeap

Step 2: Balance the heaps (after this step heaps will be either balanced or
   one of them will contain 1 more item)

   if number of elements in one of the heaps is greater than the other by
   more than 1, remove the root element from the one containing more elements and
   add to the other one

Then at any given time you can calculate median like this:

   If the heaps contain equal amount of elements;
     median = (root of maxHeap + root of minHeap)/2
   Else
     median = root of the heap with more elements

Now I will talk about the problem in general as promised in the beginning of the answer. Finding running median from a stream of data is a tough problem, and finding an exact solution with memory constraints efficiently is probably impossible for the general case. On the other hand, if the data has some characteristics we can exploit, we can develop efficient specialized solutions. For example, if we know that the data is an integral type, then we can use counting sort, which can give you a constant memory constant time algorithm. Heap based solution is a more general solution because it can be used for other data types (doubles) as well. And finally, if the exact median is not required and an approximation is enough, you can just try to estimate a probability density function for the data and estimate median using that.

这篇关于从整数流中查找运行中位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆