查找运行中位数从整数流 [英] Find running median from a stream of integers

查看:171
本文介绍了查找运行中位数从整数流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  

可能重复:
  滚动中位数算法的C

  

鉴于整数从数据流中读取。查找迄今在有效的方式读取元素的中位数。

解决方案我看了:我们可以用一个最大堆在左边重新present元素都小于有效的中位数和最小堆右侧重新present元素大于有效位数。

处理传入元件后,在堆元件的数量为至多由1元件不同。当两个堆包含元素的数相同,我们发现堆的根的数据作为有效位数的平均值。当堆不均衡,我们选择含有堆多个元素的根的有效位数。

但是,我们如何才能构建一个最大堆和最小堆也就是如何将我们所知道的有效位数吗?我认为我们会插入最大堆1元再下1分堆元,等所有的元素。纠正我,如果我错了这里。

解决方案

有一些不同的解决方案,寻找流动中位数计算,从流数据,我将简要地谈谈他们的答案的最末端。

现在的问题是关于具体的解决方案(最大堆/分堆的解决方案)的详细信息,以及如何基于堆的解决方案可以解释如下:

有关的第2元素添加较小的一个到maxHeap在左侧,和更大的一到minHeap在右边。然后过程流数据一个接一个,

 第1步:下一个项目添加到堆之一

   如果下一个项目是小于maxHeap根把它添加到maxHeap,
   否则将其添加到minHeap

第2步:平衡的堆(这一步堆后,将平衡式或非
   他们中的一个将包含1个项目)

   如果在堆的一个元素的数目大于另一个由
   超过1,除去从含有多种元素的1根元素和
   添加到另一个
 

然后在任何给定的时间,你可以计算出这样的中值:

 如果堆包含相同的元素;
     中位数=(minHeap的maxHeap +根的根)/ 2
   其他
     中位数=根堆的更多内容
 

现在我就说说一般的问题,如许的答案的开始。查找运行中值从数据流是一个棘手的问题,并找到一个精确解与内存限制有效恐怕是不可能的一般情况。另一方面,如果该数据具有一些特性,我们可以利用,我们可以开发有效的专门解决方案。举例来说,如果我们知道数据是整型,那么我们就可以使用计数排序,它可以给你一个常量内存常量时间算法。基于堆的溶液是一种更通用的解决方案,因为它可以用于其它类型的数据(双打)为好。最后,如​​果不需要确切位数和近似就足够了,可以只尝试估计的概率密度函数为使用该数据和估计值

Possible Duplicate:
Rolling median algorithm in C

Given that integers are read from a data stream. Find median of elements read so far in efficient way.

Solution I have read: We can use a max heap on left side to represent elements that are less than the effective median, and a min heap on right side to represent elements that are greater than the effective median.

After processing an incoming element, the number of elements in heaps differ at most by 1 element. When both heaps contain the same number of elements, we find the average of heap's root data as effective median. When the heaps are not balanced, we select the effective median from the root of heap containing more elements.

But how would we construct a max heap and min heap i.e. how would we know the effective median here? I think that we would insert 1 element in max-heap and then the next 1 element in min-heap, and so on for all the elements. Correct me If I am wrong here.

解决方案

There are a number of different solutions for finding running median from streamed data, I will briefly talk about them at the very end of the answer.

The question is about the details of the a specific solution (max heap/min heap solution), and how heap based solution works is explained below:

For the first two elements add smaller one to the maxHeap on the left, and bigger one to the minHeap on the right. Then process stream data one by one,

Step 1: Add next item to one of the heaps

   if next item is smaller than maxHeap root add it to maxHeap,
   else add it to minHeap

Step 2: Balance the heaps (after this step heaps will be either balanced or
   one of them will contain 1 more item)

   if number of elements in one of the heaps is greater than the other by
   more than 1, remove the root element from the one containing more elements and
   add to the other one

Then at any given time you can calculate median like this:

   If the heaps contain equal elements;
     median = (root of maxHeap + root of minHeap)/2
   Else
     median = root of the heap with more elements

Now I will talk about the problem in general as promised in the beginning of the answer. Finding running median from a stream of data is a tough problem, and finding an exact solution with memory constraints efficiently is probably impossible for the general case. On the other hand, if the data has some characteristics we can exploit, we can develop efficient specialized solutions. For example, if we know that the data is an integral type, then we can use counting sort, which can give you a constant memory constant time algorithm. Heap based solution is a more general solution because it can be used for other data types (doubles) as well. And finally, if the exact median is not required and an approximation is enough, you can just try to estimate a probability density function for the data and estimate median using that.

这篇关于查找运行中位数从整数流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆