来自accumarray的索引,最大值/最小值 [英] Index from accumarray with max/min

查看:75
本文介绍了来自accumarray的索引,最大值/最小值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个向量和一个大小相同的单元格数组(带有重复的字符串).单元格数组定义了组.我想在向量中找到每个组的最小值/最大值.

I have a vector and a cell array (with repeating strings) with the same size. The cell array defines the groups. I want to find min/max values in the vector for each group.

例如:

value = randperm(5) %# just an example, non-unique in general
value =
     4     1     2     3     5
group = {'a','b','a','c','b'};
[grnum, grname] = grp2idx(group);

我为此使用 ACCUMARRAY 函数:

grvalue = accumarray(grnum,value,[],@max);

所以我有一个具有唯一组名(grname)和新向量(grvalue)的新单元格数组.

So I have new cell array with unique group name (grname) and new vector (grvalue).

grname = 
    'a'
    'b'
    'c'
grvalue =
     4
     5
     3

但是我还需要从包含在新向量中的旧向量中查找值的位置索引.

But I also need to find location index of values from old vector that has been included into the new vector.

gridx = 1 5 4

有什么想法吗?不必使用accumarray,但我正在寻找快速的矢量化解决方案.

Any ideas? It's not necessary to use accumarray but I'm looking for fast vectorized solution.

推荐答案

我能看到的最佳矢量化答案是:

The best vectorized answer I can see is:

gridx = arrayfun(@(grix)find((grnum(:)==grix) & (value(:)==grvalue(grix)),1),unique(grnum));

但是我不能称其为快速"矢量化解决方案. arrayfun确实很有用,但通常不比循环快.

but I cannot call this a "fast" vectorized solution. arrayfun is really useful, but generally no faster than a loop.

但是,最快的答案并不总是矢量化的.如果我在编写代码时重新实现了代码,但数据集更大:

However, the fastest answer is not always vectorized. If I re-implement the code as you wrote it, but with a larger data set:

nValues = 1000000;
value = floor(rand(nValues,1)*100000);
group = num2cell(char(floor(rand(nValues,1)*4)+'a'));
tic;
[grnum, grname] = grp2idx(group);
grvalue = accumarray(grnum,value,[],@max);
toc;

我的计算机给我的tic/toc时间为0.886秒. (请注意,所有tic/tock时间都来自文件中定义的函数的第二次运行,以避免一次生成pcode.)

My computer gives me a tic/toc time of 0.886 seconds. (Note, all tic/tock times are from the second run of a function defined in a file, to avoid one-time pcode generation.)

一线gridx计算加上向量化"(实际上是arrayfun)会导致tic/tock时间为0.975秒.不错,进一步的调查显示,大部分时间都在grp2idx调用中消耗了.

Adding the "vectorized" (really arrayfun) one line gridx computation leads to a tic/tock time of 0.975 seconds. Not bad, additional investigation shows that most of the time is being consumed in the grp2idx call.

如果我们将其重新实现为非矢量化的简单循环,包括gridx计算,如下所示:

If we reimplement this as a non-vectorized, simple loop, including the gridx computation, like this:

tic
[grnum, grname] = grp2idx(group);
grvalue = -inf*ones(size(grname));
gridx = zeros(size(grname));
for ixValue = 1:length(value)
    tmpGrIdx = grnum(ixValue);
    if value(ixValue) > grvalue(tmpGrIdx)
        grvalue(tmpGrIdx) = value(ixValue);
        gridx(tmpGrIdx) = ixValue;
    end
end
toc

tic/toc时间大约为0.847秒,比原始代码快一点.

the tic/toc time is about 0.847 seconds, slightly faster than the original code.

再进一步一点,大多数时间似乎丢失在单元阵列内存访问中.例如:

Taking this a bit further, most of the time appears to be lost in the cell-array memory access. For example:

tic; groupValues = double(cell2mat(group')); toc  %Requires 0.754 seconds
tic; dummy       =       (cell2mat(group')); toc  %Requires 0.718 seconds

如果最初将组名定义为数字数组(例如,我将使用上面定义的groupValues),那么即使使用相同的代码,时间也会大大减少:

If you initially define your group names as a numeric array (for example, I'll use groupValues as I defined them above), the the times decrease quite a bit, even using the same code:

groupValues = double(cell2mat(group'));  %I'm assuming this is precomputed
tic
[grnum, grname] = grp2idx(groupValues);
grname = num2cell(char(str2double(grname))); %Recapturing your original names
grvalue = -inf*ones(size(grname));
gridx = zeros(size(grname));
for ixValue = 1:length(value)
    tmpGrIdx = grnum(ixValue);
    if value(ixValue) > grvalue(tmpGrIdx)
        grvalue(tmpGrIdx) = value(ixValue);
        gridx(tmpGrIdx) = ixValue;
    end
end
toc

这会产生tic/tock时间为0.16秒.

This produces a tic/tock time of 0.16 seconds.

这篇关于来自accumarray的索引,最大值/最小值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆