数据集数组索引使用统计工具箱非常慢 [英] Dataset array indexing is very slow with Statistics Toolbox

查看:93
本文介绍了数据集数组索引使用统计工具箱非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么索引到数据集数组的速度太慢? dataset.subsref函数的峰值表明,数据集的所有列都存储在单元格数组中.但是,单元索引比数据集索引快得多,数据集索引只是将数据集索引到底层单元格中.我的猜测是,这与MATLAB OOP的一些开销有关.关于如何加快速度的任何想法?

Why is indexing into a dataset array so slow? A peak into the dataset.subsref function shows that all the columns of the dataset are stored in a cell array. However, cell indexing is much, much faster than dataset indexing, which is just indexing into a cell array under the hood. My guess is that this has to do with some overhead with MATLAB OOP. Any ideas on how to speed this up?

%% Using R2011a, PCWIN64
feature accel off;  % turn off JIT

dat = (1:1e6)';
dat2 = repmat({'abc'}, 1e6, 1);
celldat = {dat dat2};
ds = dataset(dat, dat2);
N = 1e2;

tic;
for j = 1:N
    tmp = celldat{2};
end
toc;

tic;
for j = 1:N
    tmp2 = ds.dat2; % 2.778sec spent on line 262 of dataset.subsref
end
toc;

feature accel on;  % turn JIT back on

Elapsed time is 0.000165 seconds.
Elapsed time is 2.778995 seconds.

编辑:我已经更新了示例,使其更类似于我所遇到的问题.在dataset.subsref的第262行上花费了大量时间-"b = a.data {varIndex};".这对我来说很奇怪,因为它是一个简单的单元格取消引用.我想知道是否有一个OOP技巧可以使我索引到"a.data",而不会产生奇怪的开销.

I've updated the example to be more like the problem I'm seeing. A huge amount of time is spent on line 262 of dataset.subsref - "b = a.data{varIndex};". It's very strange to me since it is a simple cell dereference. I'm wondering if there is a OOP trick that will allow me to index into "a.data" without the strange overhead.

EDIT2 :按照安德鲁的建议,我已将此错误提交给MatWorks.如果我收到他们的任何消息,将会更新.

As per Andrew's suggestion, I've submitted this as a bug to MatWorks. Will update if I hear anything from them.

EDIT3 :Matlab回答说,他们现在已经意识到了这个问题,并将在以后的版本中予以解决.他们指出,该问题是特定于单元阵列的,并在可能的情况下尽量避免使用它们.

Matlab responded and said they are aware of the problem now and will fix it in a future release. They noted that the problem is specific to cell arrays, and to try to avoid them if possible.

推荐答案

是的,您很可能会看到Matlab OOP方法调用的开销.与单元索引或其他语言中的方法调用相比,它们很昂贵.每次调用您的.513872秒/1e4〜= 51微秒,这是一些MCOS方法调用的大约成本;在我所见过的机器上,它们每个约有5-15微秒.因此,看起来subsref()调用自身的方法开销以及依次调用的其他方法和属性访问.

Yes, you are most likely seeing the overhead of Matlab OOP method calls. They are expensive compared to cell indexing, or method calls in some other languages. Your .513872 seconds / 1e4 ~= 51 microseconds per call, which is the approximate cost of a few MCOS method calls; they're ~5-15 microsececonds each on machines I've seen. So that looks like method overhead of the subsref() call itself and other methods and property accesses it's calling in turn.

有关某些详细信息和讨论,请参见: MATLAB OOP变慢还是我做错了什么?

For some details and discussion, see: Is MATLAB OOP slow or am I doing something wrong?

除了构造代码以最大程度地减少对"ds.dat"或其他方法的调用之外,我不知道有什么方法可以使此操作更快.如果可能的话,使用数据集时,请一次调用"ds.dat",将其保留在局部变量中并在那里使用,然后将其推回到ds对象中.

I don't know of a way to make this faster, aside from structuring your code to minimize calls to "ds.dat" or other methods. If possible, when working with the data set, call "ds.dat" once, keep it in a local variable and work with it there, and then push it back in to the ds object.

注意:我不知道什么是功能加速",或者它如何影响这些时间.

Caveat: I don't know what "feature accel" does or how it could affect these timings.

我像Richie建议的那样将它扔到了探查器中.在我的R2009b上,大约一半的时间是方法调用开销,其余时间在find(),strcmp()和subsref内的其他操作中; subsref不会依次调用其他任何方法.

I threw it in the profiler like Richie suggested. On my R2009b, looks like about half the time is method call overhead, and the rest in find(), strcmp(), and other operations inside subsref; subsref doesn't call any other methods in turn.

修改后的示例显示了更高的时序.方法调用的开销并不能解决所有这些问题.

Edit 2: The revised example is showing much higher timings. Method call overhead doesn't account for all that.

这篇关于数据集数组索引使用统计工具箱非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆