计算数组中元素的最快方法是什么? [英] What is the fastest way to count elements in an array?

查看:312
本文介绍了计算数组中元素的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的模型中,最重复的任务之一是计算数组中每个元素的数量。计数是从一个封闭的集合,所以我知道有 X 类型的元素,所有或一些它们填充数组,以及代表空单元格的零。数组不以任何方式排序,并且可以相当长(约1M个元素),并且这个任务在一次模拟期间(其也是数百次模拟的一部分)进行数千次。结果应该是大小 X 的向量,因此 X(k)

c> X = 9
,如果我有以下输入向量:

  v = [0 7 8 3 0 4 4 5 3 4 4 8 3 0 6 8 5 5 0 3] 

喜欢得到这个结果:

  r = [0 0 4 4 3 1 1 3 0] 

$ c> 2
)在结果向量( r(2)== 0)的相应位置具有 0



实现这个目标的最快方法是什么?

解决方案

tl; dr:最快的方法取决于数组的大小。对于小于2的 14 方法的数组( accumarray )更快。对于大于下面方法2的数组( histcounts ),数组越大越好。






让我们看看什么是可用的方法来执行此任务。对于以下示例,我们假设 X n 个元素,从1到 n ,我们的数组是 M ,这是一个可以大小不同的列数组。我们的结果向量将是 spp 1 ,这样 spp(k) M 中的 k 虽然我在这里写的 X ,没有明确的实现它在下面的代码,我只是定义 n = 500 X 是隐式的 1:500



for 循环


最简单和直接的方法来处理这个任务是 code>循环遍历 X 中的元素,并计算 M 中等于it:

  function spp = loop(M,n)
spp = zeros(n,1);
for k = 1:size(spp,1);
spp(k)= sum(M == k);
end
end

这不是很聪明, X 中的一小组元素正在填充 M ,因此我们首先优先查看已经在 M

  function spp = uloop(M,n)
u =唯一(M); %找到要计数的元素
spp = zeros(n,1);
for k = u(u> 0)。
spp(k)= sum(M == k);
end
end






通常在MATLAB中,建议尽可能多地利用内置函数,因为大多数时候它们都快得多。我想到5个选项:



1。函数



我们可以注意到以下几点:


  1. 有趣的是,最快的方法有一个转变。对于小于2的数组 14 accumarray 是最快的。对于大于2的数组 14 histcounts 是最快的。

  2. c> for 循环,在这两个版本中是最慢的,但对于小于2 8 的数组,unique& for选项较慢。 ndgrid 成为大于2 11 的数组中最慢的,可能是因为需要在内存中存储非常大的矩阵。

  3. tabulate 在大小小于2 9 的数组上有一些不规则。

bsxfun ndgrid 曲线被截断,因为它使我的计算机陷入更高的值,趋势已经很清楚)

$ b另外,注意y轴在log 10 中,因此单位减小(类似于数组2 19 accumarray histcounts )意味着操作速度快10倍。


$ b b

我很高兴听到这个测试的改进意见,如果你有另一个,概念上不同的方法,你是最欢迎的建议它作为一个答案。



代码



这里是包含在计时函数中的所有函数:

  function out = timing_hist(N,n)
M = randi([0 n],N,1)
func_times = {'for','unique& for','tabulate','histcounts','accumarray','bsxfun','ndgrid';
timeit(@()loop(M,n)),...
timeit(@()uloop(M,n)),...
timeit (M)),...
timeit(@()histci(M,n)),...
timeit timeit(@()bsxi(M,n)),...
timeit(@()gridi(M,n))};
out = cell2mat(func_times(2,:));
end

function spp = loop(M,n)
spp = zeros(n,1);
for k = 1:size(spp,1);
spp(k)= sum(M == k);
end
end

function spp = uloop(M,n)
u = unique(M);
spp = zeros(n,1);
for k = u(u> 0)。
spp(k)= sum(M == k);
end
end

function tab = tabi(M)
tab = tabulate(M);
if tab(1)== 0
tab(1,:) = [];
end
end

function spp = histci(M,n)
spp = histcounts(M,1:n + 1)
end

function spp = accumi(M)
spp = accumarray(M(M> 0),1);
end

function spp = bsxi(M,n)
spp = bsxfun(@ eq,M,1:n)
spp = sum(spp,1);
end

function spp = gridi(M,n)
[Mx,nx] = ndgrid(M,1:n)
spp = sum(Mx == nx);
end

这里是运行此代码并生成图形的脚本:

  N = 25; %不建议对``bsxfun`和`ndgrid`函数运行N> 19。 
func_times = zeros(N,5);
for n = 1:N
func_times(n,:) = timing_hist(2 ^ n,500);
end
%plotting:
hold on
mark ='xo * ^ dsp';
for k = 1:size(func_times,2)
plot(1:size(func_times,1),log10(func_times(:,k)。* 1000),[' - 'mark )],...
'MarkerEdgeColor','k','LineWidth',1.5);
end
hold off
xlabel('Log_2(Array size)','FontSize',16)
ylabel('Log_ {10}(Execution time) ,'FontSize',16)
legend({'for','unique& for','tabulate','histcounts','accumarray','bsxfun','ndgrid'},...
'Location','NorthWest','FontSize',14)
grid on



< hr>

1 这个奇怪的名字的原因来自我的领域,生态学。我的模型是一个细胞自动机,通常模拟虚拟空间中的个体生物(上面的 M )。个体是不同的物种(因此 spp ),并且一起形成所谓的生态社区。社区的状态由来自每个物种的个体的数量给出,这是在该答案中的 spp 向量。在这个模型中,我们首先定义一个物种池( X ),以便从中抽取个体,并且社区状态考虑物种池中的所有物种,而不是只有 M


中的

In my models, one of the most repeated tasks to be done is counting the number of each element within an array. The counting is from a closed set, so I know there are X types of elements, and all or some of them populate the array, along with zeros that represent 'empty' cells. The array is not sorted in any way, and could by quite long (about 1M elements), and this task is done thousands of times during one simulation (which is also part of hundreds of simulations). The result should be a vector of size X, so X(k) is the amount of k in the array.

Example:

For X = 9, if I have the following input vector:

v = [0 7 8 3 0 4 4 5 3 4 4 8 3 0 6 8 5 5 0 3]

I would like to get this result:

r = [0 0 4 4 3 1 1 3 0]

Note that I don't want the count of zeros, and that elements that don't appear in the array (like 2) have a 0 in the corresponding position of the result vector (r(2) == 0).

What would be the fastest way to achieve this goal?

解决方案

tl;dr: The fastest method depend on the size of the array. For array smaller than 214 method 3 below (accumarray) is faster. For arrays larger than that method 2 below (histcounts) is better.


Let's see what are the available methods to perform this task. For the following examples we will assume X has n elements, from 1 to n, and our array of interest is M, which is a column array that can vary in size. Our result vector will be spp1, such that spp(k) is the number of ks in M. Although I write here about X, there is no explicit implementation of it in the code below, I just define n = 500 and X is implicitly 1:500.

The naive for loop

The most simple and straightforward way to cope this task is by a for loop that iterate over the elements in X and count the number of elements in M that equal to it:

function spp = loop(M,n)
spp = zeros(n,1);
for k = 1:size(spp,1);
    spp(k) = sum(M==k); 
end
end

This is off course not so smart, especially if only little group of elements from X is populating M, so we better look first for those that are already in M:

function spp = uloop(M,n)
u = unique(M); % finds which elements to count
spp = zeros(n,1);
for k = u(u>0).';
    spp(k) = sum(M==k); 
end
end


Usually in MATLAB, it is advisable to take advantage of the built-in functions as much as possible, since most of the times they are much faster. I thought of 5 options to do so:

1. The function tabulate

The function tabulate returns a very convenient frequency table that from first sight seem to be the perfect solution for this task:

function tab = tabi(M)
tab = tabulate(M);
if tab(1)==0
    tab(1,:) = [];
end
end

The only fix to be done is to remove the first row of the table if it counts the 0 element (it could be that there are no zeros in M).

2. The function histcounts

Another option that can be tweaked quite easily to our need it histcounts:

function spp = histci(M,n)
spp = histcounts(M,1:n+1);
end

here, in order to count all different elements between 1 to n separately, we define the edges to be 1:n+1, so every element in X has it's own bin. We could write also histcounts(M(M>0),'BinMethod','integers'), but I already tested it, and it takes more time (though it makes the function independent on n).

3. The function accumarray

The next option I'll bring here is the use of the function accumarray:

function spp = accumi(M)
spp = accumarray(M(M>0),1);
end

here we give the function M(M>0) as input, to skip the zeros, and use 1 as the vals input to count all unique elements.

4. The function bsxfun

We can even use binary operation @eq (i.e. ==) to look for all elements from each type:

function spp = bsxi(M,n)
spp = bsxfun(@eq,M,1:n);
spp = sum(spp,1);
end

if we keep the first input M and the second 1:n in different dimensions, so one is a column vector the the other is a row vector, then the function compares each element in M with each element in 1:n, and create an length(M)-by-n logical matrix than we can sum to get the desired result.

5. The function ndgrid

Another option, similar to the bsxfun, is to explicitly create the two matrices of all possibilities using the ndgrid function:

function spp = gridi(M,n)
[Mx,nx] = ndgrid(M,1:n);
spp = sum(Mx==nx);
end

then we compare them and sum over columns, to get the final result.

Benchmarking

I have done a little test to find the fastest method from all mentioned above, I defined n = 500 for all trails. For some (especially the naive for) there is a great impact of n on the time of execution, but this is not the issue here, since we want to test it for a given n.

Here are the results:

We can notice several things:

  1. Interestingly, there is a shift in the fastest method. For arrays smaller than 214 accumarray is the fastest. For arrays larger than 214 histcounts is the fastest.
  2. As expected the naive for loops, in both versions are the slowest, but for arrays smaller than 28 the "unique & for" option is slower. ndgrid become the slowest in arrays bigger than 211, probably because of the need to store very large matrices in memory.
  3. There is some irregularity in the way tabulate works on arrays in size smaller than 29. This result was consistent (with some variation in the pattern) in all the trials I conducted.

(the bsxfun and ndgrid curves are truncated because it makes my computer stuck in higher values, and the trend is quite clear already)

Also, notice that the y-axis is in log10, so a decrease in unit (like for arrays in size 219, between accumarray and histcounts) means a 10-times faster operation.

I'll be glad to hear in the comments for improvements to this test, and if you have another, conceptually different method, you are most welcome to suggest it as an answer.

The code

Here are all the functions wrapped in a timing function:

function out = timing_hist(N,n)
M = randi([0 n],N,1);
func_times = {'for','unique & for','tabulate','histcounts','accumarray','bsxfun','ndgrid';
    timeit(@() loop(M,n)),...
    timeit(@() uloop(M,n)),...
    timeit(@() tabi(M)),...
    timeit(@() histci(M,n)),...
    timeit(@() accumi(M)),...
    timeit(@() bsxi(M,n)),...
    timeit(@() gridi(M,n))};
out = cell2mat(func_times(2,:));
end

function spp = loop(M,n)
spp = zeros(n,1);
for k = 1:size(spp,1);
    spp(k) = sum(M==k); 
end
end

function spp = uloop(M,n)
u = unique(M);
spp = zeros(n,1);
for k = u(u>0).';
    spp(k) = sum(M==k); 
end
end

function tab = tabi(M)
tab = tabulate(M);
if tab(1)==0
    tab(1,:) = [];
end
end

function spp = histci(M,n)
spp = histcounts(M,1:n+1);
end

function spp = accumi(M)
spp = accumarray(M(M>0),1);
end

function spp = bsxi(M,n)
spp = bsxfun(@eq,M,1:n);
spp = sum(spp,1);
end

function spp = gridi(M,n)
[Mx,nx] = ndgrid(M,1:n);
spp = sum(Mx==nx);
end

And here is the script to run this code and produce the graph:

N = 25; % it is not recommended to run this with N>19 for the `bsxfun` and `ndgrid` functions.
func_times = zeros(N,5);
for n = 1:N
    func_times(n,:) = timing_hist(2^n,500);
end
% plotting:
hold on
mark = 'xo*^dsp';
for k = 1:size(func_times,2)
    plot(1:size(func_times,1),log10(func_times(:,k).*1000),['-' mark(k)],...
        'MarkerEdgeColor','k','LineWidth',1.5);
end
hold off
xlabel('Log_2(Array size)','FontSize',16)
ylabel('Log_{10}(Execution time) (ms)','FontSize',16)
legend({'for','unique & for','tabulate','histcounts','accumarray','bsxfun','ndgrid'},...
    'Location','NorthWest','FontSize',14)
grid on


1 The reason for this weird name comes from my field, Ecology. My models are a cellular-automata, that typically simulate individual organisms in a virtual space (the M above). The individuals are of different species (hence spp) and all together form what is called "ecological community". The "state" of the community is given by the number of individuals from each species, which is the spp vector in this answer. In this models, we first define a species pool (X above) for the individuals to be drawn from, and the community state take into account all species in the species pool, not only those present in M

这篇关于计算数组中元素的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆