加速MATLAB代码以进行FDR估算 [英] Speeding up MATLAB code for FDR estimation
问题描述
我有2个输入变量:
- 具有 N 个元素(未排序)的p值( p )的向量
- 和 N x M 矩阵,其中p值是通过 M 迭代进行随机排列( pr )获得的p值. N 很大,从10K到100K甚至更多. M 假设为100.
我正在估计p
的每个元素的错误发现率(FDR),如果当前的p值(来自p
)将作为阈值,则代表通过随机排列的p值将通过多少. /p>
我用ARRAYFUN编写了该函数,但是对于大的N(对于 N = 20K,则为2 min ),它花费了很多时间,相当于for循环.>
function pfdr = fdr_from_random_permutations(p, pr)
%# ... skipping arguments checks
pfdr = arrayfun( @(x) mean(sum(pr<=x))./sum(p<=x), p);
任何想法如何使其更快?
也欢迎您对此处的统计问题发表评论.
测试数据可以生成为p = rand(N,1); pr = rand(N,M);
.
好吧,诀窍确实是对向量进行排序.我为此赞扬@EgonGeerardyn.另外,也无需使用mean
.之后,您可以将所有内容除以M
.对p
进行排序时,查找小于当前x
的值的数量只是一个运行索引. pr
是一个更有趣的情况-我使用一个名为place
的运行索引来发现有多少元素少于x
.
编辑(2):这是我想出的最快的版本:
function Speedup2()
N = 10000/4 ;
M = 100/4 ;
p = rand(N,1); pr = rand(N,M);
tic
pfdr = arrayfun( @(x) mean(sum(pr<=x))./sum(p<=x), p);
toc
tic
out = zeros(numel(p),1);
[p,sortIndex] = sort(p);
pr = sort(pr(:));
pr(end+1) = Inf;
place = 1;
N = numel(pr);
for i=1:numel(p)
x = p(i);
while pr(place)<=x
place = place+1;
end
exp1a = place-1;
exp2 = i;
out(i) = exp1a/exp2;
end
out(sortIndex) = out/ M;
toc
disp(max(abs(pfdr-out)));
end
以及N = 10000/4 ; M = 100/4
的基准测试结果:
经过的时间为0.898689秒.
经过的时间为0.007697秒.
2.220446049250313e-016
和N = 10000 ; M = 100
;
经过的时间为39.730695秒.
经过的时间是0.088870秒.
2.220446049250313e-016
I have 2 input variables:
- a vector of p-values (p) with N elements (unsorted)
- and N x M matrix with p-values obtained by random permutations (pr) with M iterations. N is quite large, 10K to 100K or more. M let's say 100.
I'm estimating the False Discovery Rate (FDR) for each element of p
representing how many p-values from random permutations will pass if the current p-value (from p
) will be the threshold.
I wrote the function with ARRAYFUN, but it takes lot of time for large N (2 min for N=20K), comparable to for-loop.
function pfdr = fdr_from_random_permutations(p, pr)
%# ... skipping arguments checks
pfdr = arrayfun( @(x) mean(sum(pr<=x))./sum(p<=x), p);
Any ideas how to make it faster?
Comments about statistical issues here are also welcome.
The test data can be generated as p = rand(N,1); pr = rand(N,M);
.
Well, the trick was indeed sorting the vectors. I give credit to @EgonGeerardyn for that. Also, there is no need to use mean
. You can just divide everything afterwards by M
. When p
is sorted, finding the amount of values that are less than current x
, is just a running index. pr
is a more interesting case - I used a running index called place
to discover how many elements are less than x
.
Edit(2): Here is the fastest version I come up with:
function Speedup2()
N = 10000/4 ;
M = 100/4 ;
p = rand(N,1); pr = rand(N,M);
tic
pfdr = arrayfun( @(x) mean(sum(pr<=x))./sum(p<=x), p);
toc
tic
out = zeros(numel(p),1);
[p,sortIndex] = sort(p);
pr = sort(pr(:));
pr(end+1) = Inf;
place = 1;
N = numel(pr);
for i=1:numel(p)
x = p(i);
while pr(place)<=x
place = place+1;
end
exp1a = place-1;
exp2 = i;
out(i) = exp1a/exp2;
end
out(sortIndex) = out/ M;
toc
disp(max(abs(pfdr-out)));
end
And the benchmark results for N = 10000/4 ; M = 100/4
:
Elapsed time is 0.898689 seconds.
Elapsed time is 0.007697 seconds.
2.220446049250313e-016
and for N = 10000 ; M = 100
;
Elapsed time is 39.730695 seconds.
Elapsed time is 0.088870 seconds.
2.220446049250313e-016
这篇关于加速MATLAB代码以进行FDR估算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!