用于重复计算百分位数的快速算法？ [英] Fast algorithm for repeated calculation of percentile?

查看：231 发布时间：2017/4/3 11:26:40 algorithm optimization data-structures percentile

本文介绍了用于重复计算百分位数的快速算法？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在一个算法中，当我添加一个数据集时，我必须计算一个数据集的第75个百分位数值。现在我这样做：

获取值 x

在后面的已排序数组中插入 x

swap x down直到数组排序

 
 读取元素位置 array [array.size * 3/4] 点3是O（n），其余的是O（1），但是这还是很慢的，特别是如果数组变大。有没有办法优化这个？
 
 
  更新 
 
 
 感谢Nikita！由于我使用的是C ++，所以这是最容易实现的解决方案。以下是代码：
 模板< class T> 
 class IterativePercentile {
 public：
 ///百分位数必须在范围内[0，1（
 IterativePercentile（double percentile））
：_percentile（percentile）
 {} 
 
 //在O（log（n））中添加一个数字
 void add（const T& x）{
 if（_lower.empty || x <= _lower.front（））{
 _lower.push_back（x）; 
 std :: push_heap（_lower.begin（），_lower.end（），std :: less& T>（））; 
} else {
 _upper.push_back（x）; 
 std :: push_heap（_upper.begin（），_upper.end（） T>（））; 
} 
 
无符号size_lower =（无符号）（（_ lower.size（）+ _upper.size（））* _percentile）+ 1; 
 if （_lower.size（）> size_lower）{
 // lower to upper 
 std :: pop_heap（_lower.begin（），_lower.end（），std :: less< T>（） ）; 
 _upper.push_back（_lower.back（））; 
 std :: push_heap（_upper.begin（），_upper.end（），std :: greater< T>（））; 
 _lower.pop_back（）; 
} else i f（_lower.size（） size_lower）{
 // upper to lower 
 std :: pop_heap（_upper.begin（），_upper.end（），std :: greater< T>（））; 
 _lower.push_back（_upper.back（））; 
 std :: push_heap（_lower.begin（），_lower.end（），std :: less< T>（））; 
 _upper.pop_back（）; 
} 
} 
 
 ///访问O（1）中的百分位数
 const T& get（）const {
 return _lower.front（）; 
} 
 
 void clear（）{
 _lower.clear（）; 
 _upper.clear（）; 
} 
 
 private：
 double _percentile; 
 std :: vector< T> _降低; 
 std :: vector< T> _上; 
}; 
  
 
 
解决方案
你可以用两个堆。不确定是否有一个较少的设计解决方案，但是这个提供 O（logn）时间复杂度和堆也包含在大多数编程语言的标准库中。 > 
 
 
第一堆（堆A）包含最小的75％元素，另一个堆（堆B） - 其余的（最大的25％）。第一个是顶部最大的元素，第二个是最小的元素。
 
  添加元素 
 
 查看新元素 x 是否<= $ code> max（A ）。如果是，则将其添加到堆 A ，否则 - 堆 B 。
 
现在如果我们向堆A添加了 x ，它变得太大（占有75％以上的元素），我们需要从 A中删除最大的元素（O（logn）），并将其添加到堆B（也是O（logn））。
 
如果堆B变得太大，类似。
 
  查找0.75中位数 
 
 
 
 $ b $只需从A（或从B最小）中取最大的元素。需要O（logn）或O（1）时间，具体取决于堆的实现。
 
 
  编辑 
 
 As  Dolphin 指出，我们需要准确地指定每个n的每个堆应该有多大（如果我们想要精确的答案）。例如，如果 size（A）= floor（n * 0.75）和 size（B）那么，对于每个 n> 0 ， array [array.size * 3/4] = min（B）。
 
In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:

Get value x
Insert x in an already sorted array at the back
swap x down until the array is sorted
Read the element at position array[array.size * 3/4]
Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?

UPDATE

Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:
template<class T>
class IterativePercentile {
public:
  /// Percentile has to be in range [0, 1(
  IterativePercentile(double percentile)
    : _percentile(percentile)
  { }

  // Adds a number in O(log(n))
  void add(const T& x) {
    if (_lower.empty() || x <= _lower.front()) {
      _lower.push_back(x);
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
    } else {
      _upper.push_back(x);
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
    }

    unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
    if (_lower.size() > size_lower) {
      // lower to upper
      std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.push_back(_lower.back());
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.pop_back();
    } else if (_lower.size() < size_lower) {
      // upper to lower
      std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.push_back(_upper.back());
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.pop_back();
    }            
  }

  /// Access the percentile in O(1)
  const T& get() const {
    return _lower.front();
  }

  void clear() {
    _lower.clear();
    _upper.clear();
  }

private:
  double _percentile;
  std::vector<T> _lower;
  std::vector<T> _upper;
};

 解决方案 
You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

Adding element.
See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.

Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).

Similar if heap B became too big.

Finding "0.75 median"
Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

edit

As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).

                        这篇关于用于重复计算百分位数的快速算法？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

用于重复计算百分位数的快速算法？ [英] Fast algorithm for repeated calculation of percentile?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用于重复计算百分位数的快速算法？ [英] Fast algorithm for repeated calculation of percentile?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭