计算大数据的分位数 [英] Calculate quantiles for large data

查看:484
本文介绍了计算大数据的分位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我大约有300个文件,每个文件包含1000个时间序列实现(每个文件约76 MB).

I have about 300 files, each containing 1000 time series realisations (~76 MB each file).

我想从全部300000个实现中的每个时间步计算分位数(0.05、0.50、0.95).

I want to calculate the quantiles (0.05, 0.50, 0.95) at each time step from the full set of 300000 realisations.

我无法将实现合并到1个文件中,因为它会变得太大.

I cannot merge together the realisations in 1 file because it would become too large.

最有效的方法是什么?

每个矩阵都是通过运行模型生成的,但是下面的示例包含随机数:

Each matrix is generated by running a model, however here is a sample containing random numbers:

x <- matrix(rexp(10000000, rate=.1), nrow=1000)

推荐答案

至少有三个选项:

  1. 您确定它必须是完整的吗?在这里,一个10%的样本应该是一个非常非常好的近似值.
  2. 300k元素不算向量那么大,但是300k x 100+的列矩阵却很大.仅将所需的列拉入内存,而不是将整个矩阵拉入内存(如有必要,可以在每一列上重复进行此操作).
  3. 按顺序进行,可能与较小的样本一起进行,以使您从正确的起点入手.对于第5个百分位数,您只需要知道当前猜测之上有多少个项目,以及低于当前猜测有多少个.所以像这样:
  1. Are you sure it has to be from the full set? A 10% sample should be a very, very good approximation here.
  2. 300k elements isn't that big of a vector, but a 300k x 100+ column matrix is big. Pull just the column you need into memory rather than the entire matrix (can be repeated over every column if necessary).
  3. Do it sequentially, possibly in conjunction with a smaller sample to get you started in the right ballpark. For the 5th percentile, you just need to know how many items are above the current guess and how many are below. So something like:
  1. 采取1%的样本,找到其第5个百分位数.上下浮动一些公差,这样您就可以确定准确的第5个百分位数在该范围内.
  2. 分块读取矩阵.对于每个块,计算范围以上和范围以下的观察数.然后保留该范围内的所有观测值.
  3. 当您读完最后一个块时,现在将获得三条信息(上方计数,下方计数,内部观察向量).分位数的一种方法是对整个向量进行排序并找到第n个观测值,您可以使用上述信息来做到这一点:对范围内的观测值进行排序,然后找到第(n-count_below)个.

编辑:示例(3).

请注意,我不是算法的冠军设计师,几乎可以肯定有人为此设计了更好的算法.而且,该实施方式不是特别有效.如果速度对您而言很重要,请考虑使用Rcpp,甚至可以考虑使用更优化的R.制作一堆列表然后从中提取值并不是很聪明,但是以这种方式进行原型设计很容易,所以我就顺其自然了.

Note that I am not a champion algorithm designer and that someone has almost certainly designed a better algorithm for this. Also, this implementation is not particularly efficient. If speed matters to you, consider Rcpp, or even just more optimized R for this. Making a bunch of lists and then extracting values from them is not so smart, but it was easy to prototype this way so I went with it.

library(plyr)

set.seed(1)

# -- Configuration -- #
desiredQuantile <- .25

# -- Generate sample data -- #

# Use some algorithm (sampling, iteration, or something else to come up with a range you're sure the true value lies within)
guessedrange <- c( .2, .3 )
# Group the observations to correspond to the OP's files
dat <- data.frame( group = rep( seq(100), each=100 ), value = runif(10000) )

# -- Apply the algorithm -- #

# Count the number above/below and return the values within the range, by group
res <- dlply( dat, .( group ), function( x, guessedrange ) {
  above <- x$value > guessedrange[2]
  below <- x$value < guessedrange[1]
  list(
    aboveCount  = sum( above ),
    belowCount = sum( below ),
    withinValues = x$value[ !above & !below ]
  )
}, guessedrange = guessedrange )
# Exract the count of values below and the values within the range
belowCount <- sum( sapply( res, function(x) x$belowCount ) )
belowCount
withinValues <- do.call( c, sapply( res, function(x) x$withinValues ) )
str(withinValues)
# Count up until we find the within value we want
desiredQuantileCount <- floor( desiredQuantile * nrow(dat) ) #! Should fix this so it averages when there's a tie
sort(withinValues)[ desiredQuantileCount - belowCount + 1 ]
# Compare to exact value
quantile( dat$value, desiredQuantile )

最后,该值与实际版本略有不同.我怀疑我已经被一个或一个同样愚蠢的解释所取代,但也许我缺少一些基本知识.

In the end, the value is a little off from the exact version. I suspect I'm shifted over by one or some equally silly explanation, but maybe I'm missing something fundamental.

这篇关于计算大数据的分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆