用于重复计算百分位数的快速算法? [英] Fast algorithm for repeated calculation of percentile?

查看:231
本文介绍了用于重复计算百分位数的快速算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在一个算法中,当我添加一个数据集时,我必须计算一个数据集的第75个百分位数值。现在我这样做:


  1. 获取值 x

  2. 在后面的已排序数组中插入 x

  3. swap x down直到数组排序

  4. 读取元素位置 array [array.size * 3/4] 点3是O(n),其余的是O(1),但是这还是很慢的,特别是如果数组变大。有没有办法优化这个?



    更新



    感谢Nikita!由于我使用的是C ++,所以这是最容易实现的解决方案。以下是代码:

     模板< class T> 
    class IterativePercentile {
    public:
    ///百分位数必须在范围内[0,1(
    IterativePercentile(double percentile))
    :_percentile(percentile)
    {}

    //在O(log(n))中添加一个数字
    void add(const T& x){
    if(_lower.empty || x <= _lower.front()){
    _lower.push_back(x);
    std :: push_heap(_lower.begin(),_lower.end(),std :: less& T>());
    } else {
    _upper.push_back(x);
    std :: push_heap(_upper.begin(),_upper.end() T>());
    }

    无符号size_lower =(无符号)((_ lower.size()+ _upper.size())* _percentile)+ 1;
    if (_lower.size()> size_lower){
    // lower to upper
    std :: pop_heap(_lower.begin(),_lower.end(),std :: less< T>() );
    _upper.push_back(_lower.back());
    std :: push_heap(_upper.begin(),_upper.end(),std :: greater< T>());
    _lower.pop_back();
    } else i f(_lower.size() size_lower){
    // upper to lower
    std :: pop_heap(_upper.begin(),_upper.end(),std :: greater< T>());
    _lower.push_back(_upper.back());
    std :: push_heap(_lower.begin(),_lower.end(),std :: less< T>());
    _upper.pop_back();
    }
    }

    ///访问O(1)中的百分位数
    const T& get()const {
    return _lower.front();
    }

    void clear(){
    _lower.clear();
    _upper.clear();
    }

    private:
    double _percentile;
    std :: vector< T> _降低;
    std :: vector< T> _上;
    };


    解决方案

    你可以用两个。不确定是否有一个较少的设计解决方案,但是这个提供 O(logn)时间复杂度和堆也包含在大多数编程语言的标准库中。 >

    第一堆(堆A)包含最小的75%元素,另一个堆(堆B) - 其余的(最大的25%)。第一个是顶部最大的元素,第二个是最小的元素。


    1. 添加元素

    查看新元素 x 是否<= $ code> max(A )。如果是,则将其添加到堆 A ,否则 - 堆 B

    现在如果我们向堆A添加了 x ,它变得太大(占有75%以上的元素),我们需要从 A中删除最大的元素(O(logn)),并将其添加到堆B(也是O(logn))。

    如果堆B变得太大,类似。


    1. 查找0.75中位数


    $ b $只需从A(或从B最小)中取最大的元素。需要O(logn)或O(1)时间,具体取决于堆的实现。



    编辑

    As Dolphin 指出,我们需要准确地指定每个n的每个堆应该有多大(如果我们想要精确的答案)。例如,如果 size(A)= floor(n * 0.75) size(B)那么,对于每个 n> 0 array [array.size * 3/4] = min(B)


    In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:

    1. Get value x
    2. Insert x in an already sorted array at the back
    3. swap x down until the array is sorted
    4. Read the element at position array[array.size * 3/4]

    Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?

    UPDATE

    Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:

    template<class T>
    class IterativePercentile {
    public:
      /// Percentile has to be in range [0, 1(
      IterativePercentile(double percentile)
        : _percentile(percentile)
      { }
    
      // Adds a number in O(log(n))
      void add(const T& x) {
        if (_lower.empty() || x <= _lower.front()) {
          _lower.push_back(x);
          std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
        } else {
          _upper.push_back(x);
          std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
        }
    
        unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
        if (_lower.size() > size_lower) {
          // lower to upper
          std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
          _upper.push_back(_lower.back());
          std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
          _lower.pop_back();
        } else if (_lower.size() < size_lower) {
          // upper to lower
          std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
          _lower.push_back(_upper.back());
          std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
          _upper.pop_back();
        }            
      }
    
      /// Access the percentile in O(1)
      const T& get() const {
        return _lower.front();
      }
    
      void clear() {
        _lower.clear();
        _upper.clear();
      }
    
    private:
      double _percentile;
      std::vector<T> _lower;
      std::vector<T> _upper;
    };
    

    解决方案

    You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

    First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

    1. Adding element.

    See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
    Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
    Similar if heap B became too big.

    1. Finding "0.75 median"

    Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

    edit
    As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).

    这篇关于用于重复计算百分位数的快速算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆