“在线"(迭代器)用于估计统计中位数、众数、偏度、峰度的算法? [英] "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?

查看:17
本文介绍了“在线"(迭代器)用于估计统计中位数、众数、偏度、峰度的算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种算法可以估计一组值的中值、众数、偏度和/或峰度,但不需要一次性将所有值存储在内存中?

Is there an algorithm to estimate the median, mode, skewness, and/or kurtosis of set of values, but that does NOT require storing all the values in memory at once?

我想计算基本统计数据:

I'd like to calculate the basic statistics:

  • mean:算术平均
  • 方差:与均值的平方偏差的平均值
  • 标准偏差:方差的平方根
  • 中位数:将大半数与小半数分开的值
  • mode:集合中出现频率最高的值
  • 偏度:tl;博士
  • 峰度:tl;博士

计算任何这些的基本公式是小学算术,我确实知道它们.也有许多实现它们的统计库.

The basic formulas for calculating any of these is grade-school arithmetic, and I do know them. There are many stats libraries that implement them, as well.

我的问题是我正在处理的集合中有大量(数十亿)值:在 Python 中工作,我不能只制作包含数十亿个元素的列表或散列.即使我用 C 写了这个,十亿元素的数组也不太实用.

My problem is the large number (billions) of values in the sets I'm handling: Working in Python, I can't just make a list or hash with billions of elements. Even if I wrote this in C, billion-element arrays aren't too practical.

数据未排序.它是由其他过程随机、即时生产的.每个集合的大小变化很大,大小不会提前知道.

The data is not sorted. It's produced randomly, on-the-fly, by other processes. The size of each set is highly variable, and the sizes will not be known in advance.

我已经想出了如何很好地处理均值和方差,以任何顺序遍历集合中的每个值.(实际上,就我而言,我按照它们生成的顺序来处理它们.)这是我正在使用的算法,礼貌 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:

I've already figured out how to handle the mean and variance pretty well, iterating through each value in the set in any order. (Actually, in my case, I take them in the order in which they're generated.) Here's the algorithm I'm using, courtesy http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:

  • 初始化三个变量:count、sum 和 sum_of_squares
  • 对于每个值:
    • 递增计数.
    • 将值与总和相加.
    • 将值的平方与 sum_of_squares 相加.

    这种在线"算法有一些弱点(例如,由于 sum_of_squares 迅速增长到大于整数范围或浮点精度而导致的精度问题),但它基本上可以满足我的需要,而无需将每个值都存储在每个集合中.

    This "on-line" algorithm has weaknesses (e.g., accuracy problems as sum_of_squares quickly grows larger than integer range or float precision), but it basically gives me what I need, without having to store every value in each set.

    但我不知道是否存在类似的技术来估计额外的统计数据(中值、众数、偏度、峰度).只要处理 N 值所需的内存大大小于 O(N),我就可以接受有偏差的估计器,甚至是在一定程度上影响准确性的方法.

    But I don't know whether similar techniques exist for estimating the additional statistics (median, mode, skewness, kurtosis). I could live with a biased estimator, or even a method that compromises accuracy to a certain degree, as long as the memory required to process N values is substantially less than O(N).

    如果库具有在线"计算这些操作中的一个或多个的功能,则将我指向现有的统计库也会有所帮助.

    Pointing me to an existing stats library will help, too, if the library has functions to calculate one or more of these operations "on-line".

    推荐答案

    偏度和峰度

    关于偏度和峰度的在线算法(沿着方差),请参见同一维基页面 此处 用于高阶矩统计的并行算法.

    For the on-line algorithms for Skewness and Kurtosis (along the lines of the variance), see in the same wiki page here the parallel algorithms for higher-moment statistics.

    中位数

    没有排序数据的中位数很难.如果你知道,你有多少数据点,理论上你只需要部分排序,例如通过使用选择算法.但是,这对于数十亿的值并没有太大帮助.我建议使用频率计数,请参阅下一节.

    Median is tough without sorted data. If you know, how many data points you have, in theory you only have to partially sort, e.g. by using a selection algorithm. However, that doesn't help too much with billions of values. I would suggest using frequency counts, see the next section.

    具有频率计数的中值和众数

    如果是整数,我会数数频率,可能会切断超出我确信的某个值的最高和最低值它不再相关.对于浮点数(或太多整数),我可能会创建桶/间隔,然后使用与整数相同的方法.根据频率表,(近似值)众数和中值计算变得容易.

    If it is integers, I would count frequencies, probably cutting off the highest and lowest values beyond some value where I am sure that it is no longer relevant. For floats (or too many integers), I would probably create buckets / intervals, and then use the same approach as for integers. (Approximate) mode and median calculation than gets easy, based on the frequencies table.

    正态分布随机变量

    如果它是正态分布的,我会使用总体样本mean方差、skewnesskurtosis 作为最大似然小子集的估计量.计算这些的(在线)算法,你现在已经.例如.读取数十万或数百万个数据点,直到您的估计误差变得足够小.只需确保您从集合中随机选择(例如,您不会通过选择前 100'000 个值来引入偏差).同样的方法也可以用于估计正常情况下的众数和中位数(因为样本均值都是一个估计量).

    If it is normally distributed, I would use the population sample mean, variance, skewness, and kurtosis as maximum likelihood estimators for a small subset. The (on-line) algorithms to calculate those, you already now. E.g. read in a couple of hundred thousand or million datapoints, until your estimation error gets small enough. Just make sure that you pick randomly from your set (e.g. that you don't introduce a bias by picking the first 100'000 values). The same approach can also be used for estimating mode and median for the normal case (for both the sample mean is an estimator).

    进一步评论

    如果有帮助,以上所有算法都可以并行运行(包括许多排序和选择算法,例如 QuickSort 和 QuickSelect).

    All the algorithms above can be run in parallel (including many sorting and selection algorithm, e.g. QuickSort and QuickSelect), if this helps.

    我一直假设(除了关于正态分布的部分)我们谈论的是样本矩、中值和众数,而不是给定已知分布的理论矩的估计量.

    I have always assumed (with the exception of the section on the normal distribution) that we talk about sample moments, median, and mode, not estimators for theoretical moments given a known distribution.

    一般来说,考虑到数据量,采样数据(即只查看子集)应该非常成功,只要所有观察都是相同随机变量(具有相同分布)和矩的实现, 众数和中位数实际上存在于该分布中.最后一个警告并非无害.例如,柯西分布的均值(以及所有更高的矩)不存在.在这种情况下,小"子集的样本均值可能与整个样本的样本均值相差很大.

    In general, sampling the data (i.e. only looking at a sub-set) should be pretty successful given the amount of data, as long as all observations are realizations of the same random variable (have the same distributions) and the moments, mode and median actually exist for this distribution. The last caveat is not innocuous. For example, the mean (and all higher moments) for the Cauchy Distribution do not exist. In this case, the sample mean of a "small" sub-set might be massively off from the sample mean of the whole sample.

    这篇关于“在线"(迭代器)用于估计统计中位数、众数、偏度、峰度的算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆