"在线" (迭代器)算法,估算统计位数,众数,偏度,峰度? [英] "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?

查看:357
本文介绍了"在线" (迭代器)算法,估算统计位数,众数,偏度,峰度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个算法来估算位数,众数,偏度,和/组值的或峰度,但这并不需要存储在内存中的所有值一次?

Is there an algorithm to estimate the median, mode, skewness, and/or kurtosis of set of values, but that does NOT require storing all the values in memory at once?

我想计算的基本统计信息:

I'd like to calculate the basic statistics:

  • 的意思是:算术平均值
  • 变化:从的均值偏差的平方
  • 标准差:方差的平方根
  • 位数:值从较小的半
  • 分隔更大数目的一半
  • 模式:在设置中最常见的值
  • 偏度:TL;博士
  • 峰度:TL;博士
  • mean: arithmetic average
  • variance: average of squared deviations from the mean
  • standard deviation: square root of the variance
  • median: value that separates larger half of the numbers from the smaller half
  • mode: most frequent value found in the set
  • skewness: tl; dr
  • kurtosis: tl; dr

基本公式计算任何这些是等级学校的算术,我也认识他们。有迹象表明,实现了他们很多的统计库,也是如此。

The basic formulas for calculating any of these is grade-school arithmetic, and I do know them. There are many stats libraries that implement them, as well.

我的问题是值的,我处理组的大量(十亿):使用Python,我不能只是做一个列表或哈希与数十亿的元素。即使我用C写了这个,数十亿单元阵列不太现实。

My problem is the large number (billions) of values in the sets I'm handling: Working in Python, I can't just make a list or hash with billions of elements. Even if I wrote this in C, billion-element arrays aren't too practical.

中的数据进行排序。它产生随机上即时,由其他进程。各组的大小是高度可变的,以及大小不会提前知道。

The data is not sorted. It's produced randomly, on-the-fly, by other processes. The size of each set is highly variable, and the sizes will not be known in advance.

我已经想通了如何处理的均值和方差pretty的同时,通过设置任意顺序每个值进行迭代。 (其实,在我的情况,我把他们在他们正在生成的顺序。)下面是我使用的算法,礼貌<一href="http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm">http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:

I've already figured out how to handle the mean and variance pretty well, iterating through each value in the set in any order. (Actually, in my case, I take them in the order in which they're generated.) Here's the algorithm I'm using, courtesy http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:

  • 在初始化三个变量:计数,总和,并sum_of_squares
  • 对于每一个值:
    • 增计数。
    • 添加值来概括。
    • 添加值的平方sum_of_squares。
    • Initialize three variables: count, sum, and sum_of_squares
    • For each value:
      • Increment count.
      • Add the value to sum.
      • Add the square of the value to sum_of_squares.

      本上线算法的弱点(如精度问题sum_of_squares快速增长不是整数的范围或浮动precision大),但它基本上给我我需要什么,而不必存储在每个每个值集。

      This "on-line" algorithm has weaknesses (e.g., accuracy problems as sum_of_squares quickly grows larger than integer range or float precision), but it basically gives me what I need, without having to store every value in each set.

      但我不知道类似的技术是否用于估计其他统计量(中位数,众偏度,峰度)存在。我可以忍受一个偏估计,甚至是一种方法,妥协的准确性在一定程度上,只要处理的N值所需要的内存比O(N)。

      But I don't know whether similar techniques exist for estimating the additional statistics (median, mode, skewness, kurtosis). I could live with a biased estimator, or even a method that compromises accuracy to a certain degree, as long as the memory required to process N values is substantially less than O(N).

      指点我到现有的统计库将帮助也一样,如果库中有函数来计算一个上线,这些或多个操作。

      Pointing me to an existing stats library will help, too, if the library has functions to calculate one or more of these operations "on-line".

      推荐答案

      偏度和峰度

      有关上线算法偏度和峰度(沿方差线),看到在同一个Wiki页面<一href="http://en.wikipedia.org/wiki/Algorithms%5Ffor%5Fcalculating%5Fvariance#Higher-order_statistics">here并行算法更高时刻的统计数据。

      For the on-line algorithms for Skewness and Kurtosis (along the lines of the variance), see in the same wiki page here the parallel algorithms for higher-moment statistics.

      中位数

      中位数是没有排序的数据坚韧。如果你知道,你有多少个数据点,理论上你只需要部分地排序,如通过使用选择算法。然而,这并不能帮助太多数十亿值。我会建议使用频率计数,请参见下一节。

      Median is tough without sorted data. If you know, how many data points you have, in theory you only have to partially sort, e.g. by using a selection algorithm. However, that doesn't help too much with billions of values. I would suggest using frequency counts, see the next section.

      中位数,众与频率计数

      如果是整数,我也要算 频率的,可能是切断超过一定价值的最高值和最低值的地方我相信它不再是相关的。对于花车(或太多的整数),我可能会创造桶/间隔,然后用同样的方法为整数。 (近似)模式和中位数计算比变得容易,基于频率表。

      If it is integers, I would count frequencies, probably cutting off the highest and lowest values beyond some value where I am sure that it is no longer relevant. For floats (or too many integers), I would probably create buckets / intervals, and then use the same approach as for integers. (Approximate) mode and median calculation than gets easy, based on the frequencies table.

      通常分布的随机变量

      如果它是正态分布的,我会用人口抽样意味着,的variance 峰度作为最大似然估计的一小部分。在(上线)算法来计算,现在这些,你已经。例如。读几十万或百万个数据点,直到你的估计误差变得足够小。只要确保你随机从您所设定的(例如,你不要被采摘的第一个100'000值引入偏差)挑。同样的方法也可以用于估计模式和中位数为正常的情况下(对于两个样品平均值估计器)。

      If it is normally distributed, I would use the population sample mean, variance, skewness, and kurtosis as maximum likelihood estimators for a small subset. The (on-line) algorithms to calculate those, you already now. E.g. read in a couple of hundred thousand or million datapoints, until your estimation error gets small enough. Just make sure that you pick randomly from your set (e.g. that you don't introduce a bias by picking the first 100'000 values). The same approach can also be used for estimating mode and median for the normal case (for both the sample mean is an estimator).

      进一步意见

      所有上述可以并行运行算法(包括许多排序和选择算法,如快速排序和QuickSelect),如果这有助于

      All the algorithms above can be run in parallel (including many sorting and selection algorithm, e.g. QuickSort and QuickSelect), if this helps.

      我一直认为(除正态分布的部分),我们所讲的样本矩,中位数和模式,而不是估计的给定一个已知分布理论的时刻。

      I have always assumed (with the exception of the section on the normal distribution) that we talk about sample moments, median, and mode, not estimators for theoretical moments given a known distribution.

      在一般情况下,采样数据(即只在看的一个子集)应该是pretty的成功给定的数据量,只要所有观察都是相同的随机变量的实现中(具有相同的分布)和时刻,模式和中位数实际上是对这种分布存在。最后需要注意的是不是无害的。例如,平均(和所有高阶矩)为柯西分布的不存在。在这种情况下,小的子集的样本均值可能从整个样本的样本均值是大规模脱

      In general, sampling the data (i.e. only looking at a sub-set) should be pretty successful given the amount of data, as long as all observations are realizations of the same random variable (have the same distributions) and the moments, mode and median actually exist for this distribution. The last caveat is not innocuous. For example, the mean (and all higher moments) for the Cauchy Distribution do not exist. In this case, the sample mean of a "small" sub-set might be massively off from the sample mean of the whole sample.

      这篇关于&QUOT;在线&QUOT; (迭代器)算法,估算统计位数,众数,偏度,峰度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆