快速算法百分重复计算? [英] Fast algorithm for repeated calculation of percentile?

查看:167
本文介绍了快速算法百分重复计算?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在一个算法我要计算一个数据集的第75百分位每当我添加一个值。现在我这样做:

  1. 获取值 X
  2. 插入 X 在后面的已排序数组
  3. 交换 X ,直到数组排序
  4. 阅读元素在位置阵列[array.size * 3/4​​]

点3为O(n),其余的是O(1),但是这仍然是很慢,特别是如果该数组变大。有什么办法来优化这个?

更新

感谢尼基塔!由于我使用C ++,这是最简单的解决方案来实现。这里是code:

 模板<类T>
类IterativePercentile {
上市:
  ///百分必须处于范围[0,1(
  IterativePercentile(双百分)
    :_percentile(百分)
  {}

  //添加许多在O(日志(N))
  空加(const的T&放大器; X){
    如果(_lower.empty()|| X  - 其中= _lower.front()){
      _lower.push_back(X);
      的std :: push_heap(_lower.begin(),_lower.end(),标准::少< T>());
    } 其他 {
      _upper.push_back(X);
      的std :: push_heap(_upper.begin(),_upper.end(),性病::更大< T>());
    }

    无符号size_lower =(无符号)((_ lower.size()+ _upper.size())* _percentile)+ 1;
    如果(_lower.size()> size_lower){
      //下往上
      的std :: pop_heap(_lower.begin(),_lower.end(),标准::少< T>());
      _upper.push_back(_lower.back());
      的std :: push_heap(_upper.begin(),_upper.end(),性病::更大< T>());
      _lower.pop_back();
    }否则,如果(_lower.size()< size_lower){
      //上,以降低
      的std :: pop_heap(_upper.begin(),_upper.end(),性病::更大< T>());
      _lower.push_back(_upper.back());
      的std :: push_heap(_lower.begin(),_lower.end(),标准::少< T>());
      _upper.pop_back();
    }
  }

  ///访问百分位在O(1)
  常量T&放大器;得到()const的{
    返回_lower.front();
  }

  无效明确(){
    _lower.clear();
    _upper.clear();
  }

私人:
  双_percentile;
  的std ::矢量< T> _降低;
  的std ::矢量< T> _上;
};
 

解决方案

您可以用两个的的。不知道是否有一个不太做作的解决方案,但是这一次提供了 O(LOGN)时间复杂度和堆也被包括在大多数编程语言的标准库。

首先堆(堆)中包含最小75%的元素,另外一个堆(堆B) - 其余的(最大25%)。第一个具有在顶部最大元素,第二个 - 最小

  1. 添加元素。

查看是否有新的元素 X 为< = 最大值(A)。如果是,将其添加到堆 A ,否则 - 堆 B
现在,如果我们加入 X 来堆,并成为太大(包含元素的75%以上),我们需要从删除最大元素A (O(LOGN)),并把它添加到堆B(也是O(LOGN))。
如果堆B变为太大类似。

<醇开始=2>
  • 查找0.75位数
  • 就拿从A中的最大元素(或最小距离B)。需要O(LOGN)和O(1)时间,这取决于堆实现。

    修改
    由于海豚指出,我们需要指定precisely每堆究竟应该多大,每N(如果我们想precise答案)。例如,如果尺寸(A)=地板(N * 0.75)尺寸(B)就是休息,那么,对每一 N'GT; 0 阵列[array.size * 3/4​​] =分钟(B)

    In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:

    1. Get value x
    2. Insert x in an already sorted array at the back
    3. swap x down until the array is sorted
    4. Read the element at position array[array.size * 3/4]

    Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?

    UPDATE

    Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:

    template<class T>
    class IterativePercentile {
    public:
      /// Percentile has to be in range [0, 1(
      IterativePercentile(double percentile)
        : _percentile(percentile)
      { }
    
      // Adds a number in O(log(n))
      void add(const T& x) {
        if (_lower.empty() || x <= _lower.front()) {
          _lower.push_back(x);
          std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
        } else {
          _upper.push_back(x);
          std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
        }
    
        unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
        if (_lower.size() > size_lower) {
          // lower to upper
          std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
          _upper.push_back(_lower.back());
          std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
          _lower.pop_back();
        } else if (_lower.size() < size_lower) {
          // upper to lower
          std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
          _lower.push_back(_upper.back());
          std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
          _upper.pop_back();
        }            
      }
    
      /// Access the percentile in O(1)
      const T& get() const {
        return _lower.front();
      }
    
      void clear() {
        _lower.clear();
        _upper.clear();
      }
    
    private:
      double _percentile;
      std::vector<T> _lower;
      std::vector<T> _upper;
    };
    

    解决方案

    You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

    First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

    1. Adding element.

    See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
    Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
    Similar if heap B became too big.

    1. Finding "0.75 median"

    Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

    edit
    As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).

    这篇关于快速算法百分重复计算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆