重复计算百分位数的快速算法? [英] Fast algorithm for repeated calculation of percentile?

查看:19
本文介绍了重复计算百分位数的快速算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在算法中,我必须计算数据集的 75th percentile价值.现在我正在这样做:

In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:

  1. 获取值x
  2. 在后面已经排序好的数组中插入x
  3. 向下交换 x 直到数组被排序
  4. 读取位置array[array.size * 3/4]
  5. 的元素
  1. Get value x
  2. Insert x in an already sorted array at the back
  3. swap x down until the array is sorted
  4. Read the element at position array[array.size * 3/4]

点 3 是 O(n),其余是 O(1),但这仍然很慢,尤其是当数组变大时.有什么办法可以优化这个吗?

Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?

更新

谢谢尼基塔!由于我使用的是 C++,因此这是最容易实现的解决方案.代码如下:

Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:

template<class T>
class IterativePercentile {
public:
  /// Percentile has to be in range [0, 1(
  IterativePercentile(double percentile)
    : _percentile(percentile)
  { }

  // Adds a number in O(log(n))
  void add(const T& x) {
    if (_lower.empty() || x <= _lower.front()) {
      _lower.push_back(x);
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
    } else {
      _upper.push_back(x);
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
    }

    unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
    if (_lower.size() > size_lower) {
      // lower to upper
      std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.push_back(_lower.back());
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.pop_back();
    } else if (_lower.size() < size_lower) {
      // upper to lower
      std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.push_back(_upper.back());
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.pop_back();
    }            
  }

  /// Access the percentile in O(1)
  const T& get() const {
    return _lower.front();
  }

  void clear() {
    _lower.clear();
    _upper.clear();
  }

private:
  double _percentile;
  std::vector<T> _lower;
  std::vector<T> _upper;
};

推荐答案

你可以用两个 .不确定是否有一个不那么人为"的解决方案,但这个解决方案提供了 O(logn) 时间复杂度,并且堆也包含在大多数编程语言的标准库中.

You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

第一个堆(堆 A)包含最小的 75% 元素,另一个堆(堆 B)-其余(最大的 25%).第一个元素在顶部,第二个元素最小.

First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

  1. 添加元素.

查看新元素 x 是否为 max(A).如果是,则将其添加到 A 堆中,否则 - 将其添加到 B 堆中.
现在,如果我们将 x 添加到堆 A 并且它变得太大(包含超过 75% 的元素),我们需要从 A 中删除最大的元素(O(logn)) 并将其添加到堆 B(也是 O(logn)).
如果堆 B 变得太大,则类似.

See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
Similar if heap B became too big.

  1. 找到0.75 中位数"

只需从 A 中取出最大的元素(或从 B 中取出最小的元素).需要 O(logn) 或 O(1) 时间,具体取决于堆实现.

Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

编辑
正如Dolphin 所指出的,我们需要精确指定每个堆对于每个 n 应该有多大(如果我们想要精确的答案).例如,如果 size(A) = floor(n * 0.75) 并且 size(B) 是剩下的,那么,对于每个 n >0, array[array.size * 3/4] = min(B).

edit
As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).

这篇关于重复计算百分位数的快速算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆