元胞数组、结构数组和标量结构的基本限制? [英] Fundamental limitations of cell arrays, arrays of structs, and scalar structs?

查看:24
本文介绍了元胞数组、结构数组和标量结构的基本限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我几十年来一直断断续续地使用 Matlab.我以为我对数组、结构、元胞数组、表、结构数组以及每个字段都是数组的结构有很好的掌握.对于后两个,我假设每个字段都需要是统一类型.我发现不存在这样的限制:

也许随着时间的推移,Matlab 变得越来越灵活(我使用的是 2015b),但是如果我发现对每种类型的局限性的理解是错误的,它确实会削弱我为任务选择最佳变量类型的信心.就这个问题而言,我无法真正阐明任务的需求,因为我将大型待办事项分解为任务的方式取决于我对可用数据类型的理解及其优点/局限性.

我可以(并且已经)阅读了在线文档,虽然他们将引导您完成代码以说明数据类型能够做什么,但我还没有遇到关于单元格之间比较限制的简明描述数组、结构数组以及字段本身就是数组的结构——以至于我可以使用这些知识在给定情况下选择最佳结构.基本的东西,我确实发现,例如,相同的字段名称将出现在结构数组的每个结构中(但如上例所示,每个结构的每个字段都可以包含高度异构的数据类型和/或数组大小).

问题

谁能指出元胞数组、结构数组和字段本身就是数组的标量结构之间的限制比较?我正在寻找一种处理方法,它可以告知编码人员在 (i) 速度、(ii) 内存和 (iii) 可读性、可维护性和可进化性之间做出最佳权衡.

我故意省略了表格,因为尽管我很喜欢它们对数据集(及其表示)的便捷访问和子集化,但事实证明,它们处理数据的速度相当慢.它们有它们的用途,我很随意地使用它们,但为了进行比较,这是底层算法编码,我对它们不感兴趣.

解决方案

我认为您的问题最终会缩小到这三种类型"的数据结构:

<块引用>

元胞数组、结构体数组和字段本身就是数组的结构体之间的比较限制

[请注意,其字段本身就是数组的结构"我在这里将其翻译为标量结构".结构数组也可以包含任意数组.我的想法在下面变得清晰,我希望.]

对我来说,这些并没有太大的不同.这三个都是异构数据的容器.(异构数据是非均匀数据,每个数据元素可能具有不同的类型和大小.)这些语句中的每一个都可以返回任何类型的数组,与容器中任何其他数组的类型无关:

  • 元胞数组:array{i,j}

  • 结构数组:array(i,j).value

  • 标量结构:array.value

所以这完全取决于你想如何索引:

array(i,j).value^ ^甲乙

如果您只想使用 A 进行索引,请使用元胞数组(当然,您随后需要花括号).如果您只想使用 B 进行索引,请使用标量结构.如果您需要 AB,请使用结构数组.

据我所知,成本没有差异.这些容器中包含的每个数组都占用一些空间.各个容器的空间开销大同小异,我从来没有注意到时间开销的差异.

然而,这两者之间存在巨大差异:

array(i).value % s1数组值(i) % s2

我认为这个问题也涉及这种差异.s1s2 有更多的空间开销:

<代码>>>s1=struct('value',num2cell(1:100))s1 =具有字段的 1×100 结构数组:价值>>s2=struct('值',1:100)s2 =带字段的结构:值:[1×100 双]>>谁是名称大小字节类属性s1 1x100 12064 结构s2 1x1 976 结构

数据需要 800 字节,所以 s2 有 176 字节的开销,而 s1 有 11264 (1408%)!

原因不在于容器,而是我们将一个包含 100 个元素的数组存储在一个数组中,而将 100 个数组存储在另一个数组中.每个数组都有一个特定大小的标头,MATLAB 使用它来了解它是什么类型的数组、它有什么大小,以管理其存储和延迟复制机制.数组越少,使用的内存就越少.

因此,不要使用异构容器来存储标量!这些东西只有在您需要存储更大的数组或不同类型或大小的数组时才有意义.

<小时>

未明确询问(并且在编辑后明确询问)的异构容器是表.表类似于标量结构,表的每一列都是一个数组,不同的列可以有不同的类型.请注意,可以将元胞数组用作列,允许将异类元素存储在列中,但如果不是这种情况,它们最有意义.

与标量结构的一个区别是每列必须具有相同的行数.另一个区别是索引可以看起来像元胞数组、标量结构或结构数组.

因此,该表对包含的数据施加了一些约束,这在某些情况下非常有益.

但是,正如 OP 所指出的,使用表比使用结构慢.这是因为 table 是一个自定义类,而不是像结构体和元胞数组这样的本机类型.如果您在 MATLAB 中键入 edit table,您将看到源代码及其实现方式.它是一个 classdef 文件,就像我们任何人都可以编写的一样.因此,它具有相同的速度限制:JIT 没有针对它进行优化,对表进行索引意味着运行编写为 M 文件的函数等.

<小时>

还有一件事:不要创建结构的元胞数组,或使用元胞数组的标量结构.这会增加容器的级别,从而增加开销(空间和时间),并使内容更难以使用.我在这里看到了与由这种类型的构造引起的访问数据困难相关的问题:

data{i,j}.value % 带有结构的元胞数组.不要这样做!data.value{i,j} % 带有元胞数组的结构.不要这样做!

第一个例子相当于一个结构体数组(有更多的开销),只是没有控制每个单元格内的结构体字段.也就是说,其中一个单元格可能没有 .value 字段.

仅当 value 与第二个结构体字段的大小不同时,第二个示例才有意义.如果所有 struct 字段都是(应该是)像这样大小相同的元胞数组,则使用 struct 数组.同样,开销更少,一致性更高.

I've been using Matlab on and off for decades. I thought I had a good grip on arrays, structs, cell arrays, tables, an array of structs, and a struct in which each field is an array. For the latter two, I assumed that each field needed to be of uniform type. I'm finding that no such limitation exists:

Perhaps Matlab is becoming more flexible with the years (I'm using 2015b), but it does undermine my confidence in choosing the best type of variable for a task if I find that understanding of the limitations of each type is wrong. For the purpose of this question, I can't really articulate the needs of the task because the manner in which I break down a large to-do into tasks depends on my understanding of the data types at my disposal, and their advantages/limitations.

I can (and have) read online documentation ad nauseum, and while they will walk you through code to illustrate what the data types are able to do, I haven't yet come across a succinct description of the comparative limitations between cell arrays, arrays of structs, and structs whose fields are themselves arrays -- to the point that I can use that knowledge to choose the best structure in a given situation. Basic stuff, I do find, e.g., the same field names will occur in each struct of a struct array (but as the above example shows, each field of each struct can contain highly heterogeneous data types and/or array sizes).

THE QUESTION

Can anyone point to such a comparison of limitations between cell arrays, arrays of structs, and scalar structs whose fields are themselves arrays? I'm looking for a treatment at a level that informs a coder in deciding on the best trade-off between (i) speed, (ii) memory, and (iii) readability, maintainability, and evolvability.

I've deliberately left out tables because, although I'm enamoured of their convenient access to, and subsetting of, data sets (and presentation thereof), they have proved rather slow for manipulation of data. They have their uses, and I use them liberally, but I'm not interested in them for the purpose of this comparison, which is under-the-hood algorithm coding.

解决方案

I think your question eventually narrows down to these three "types" of data structures:

comparative limitations between cell arrays, arrays of structs, and structs whose fiels are themselves arrays

[Note that "structs whose fields are themselves arrays" I translate as "scalar structs" here. An array of structs can also contain arbitrary arrays. My thinking becomes clear below, I hope.]

To me, these are not very different. All three are containers for heterogeneous data. (Heterogeneous data is non-uniform data, each data element is potentially of a different type and size.) Each of these statements can return an array of any type, unrelated to the type of any other array in the container:

  • cell array: array{i,j}

  • struct array: array(i,j).value

  • scalar struct: array.value

So it all depends on how you want to index:

array(i,j).value
       ^     ^
       A     B

If you want to index using A only, use a cell array (though you then need curly braces, of course). If you want to index using B only, use a scalar struct. If you want both A and B, use a struct array.

There is no difference in cost that I'm aware of. Each of the arrays contained in these containers takes up some space. The spatial overhead of the various containers is similar, and I have never noted a time overhead difference.

However, there is a huge difference between these two:

array(i).value   % s1
array.value(i)   % s2

I think that the question deals with this difference also. s1 has a lot more spatial overhead than s2:

>> s1=struct('value',num2cell(1:100))
s1 = 
  1×100 struct array with fields:
    value
>> s2=struct('value',1:100)
s2 = 
  struct with fields:
    value: [1×100 double]
>> whos
  Name      Size             Bytes  Class     Attributes
  s1        1x100            12064  struct              
  s2        1x1                976  struct              

The data needs 800 bytes, so s2 has 176 bytes of overhead, whereas s1 has 11264 (1408%)!

The reason is not the container, but the fact that we're storing one array with 100 elements in one, and 100 arrays with one element in the other. Each array has a header of a certain size that MATLAB uses to know what type of array it is, what sizes it has, to manage its storage and the delayed copy mechanism. The fewer arrays one has, the less memory one uses.

So, don't use a heterogeneous container to store scalars! These things only make sense when you need to store larger arrays, or arrays of different type or size.


The heterogeneous container that is not explicitly asked about (and after the edit explicitly not asked about) is the table. A table is similar to a scalar struct in that each column of the table is a single array, and different columns can have different types. Note that it is possible to use a cell array as a column, allowing for heterogenous elements to be stored in a column, but they make most sense if this is not the case.

One difference with a scalar struct is that each column must have the same number of rows. Another difference is that indexing can look like that of a cell array, a scalar struct, or a struct array.

Thus, the table forces some constrains upon the contained data, which is very beneficial in some circumstances.

However, and as the OP noted, working with tables is slower than working with structs. This is because table is a custom class, not a native type like structs and cell arrays. If you type edit table in MATLAB, you'll see the source code, how it's implemented. It's a classdef file, just like something any of us could write. Consequently, it has the same speed limitations: the JIT is not optimized for it, indexing into a table implies running a function written as an M-file, etc.


One more thing: Don't create cell arrays of structs, or scalar structs with cell arrays. This increases the levels of containers, which increases overhead (both in space and time), and makes the contents more difficult to use. I have seen questions here on SO related to difficulty accessing data, caused by this type of construct:

data{i,j}.value   % A cell array with structs. Don't do this!
data.value{i,j}   % A struct with cell arrays. Don't do this!

The first example is equal to a struct array (with a lot more overhead), except there is no control over the struct fields within each cell. That is, it is possible for one of the cells to not have a .value field.

The second example makes sense only if value is a different size than a second struct field. If all struct fields are (supposed to be) cell arrays of the same size like this, then use a struct array. Again, less overhead and more uniformity.

这篇关于元胞数组、结构数组和标量结构的基本限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆