Matlab:结构数组的基本限制? [英] Matlab: Fundamental limitations of struct array?

查看:76
本文介绍了Matlab:结构数组的基本限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几十年来,我一直在使用Matlab.我以为我对数组,结构,单元格数组,表,结构数组以及每个字段都是数组的结构有很好的掌握.对于后两个,我假设每个字段都必须是统一的类型.我发现不存在这样的限制:

也许随着岁月的流逝,Matlab变得越来越灵活(我使用的是2015b),但是如果我发现对每种类型的局限性的理解是错误的,它的确削弱了我为任务选择最佳变量类型的信心.出于这个问题的目的,我无法真正阐明任务的需求,因为我将大量工作分解为任务的方式取决于我对可用数据类型及其优势/局限性的理解.

我可以(并且已经)阅读过在线文档,尽管他们会带您浏览代码来说明数据类型可以做什么,但是我还没有对单元格之间的比较局限性进行简洁的描述.数组,结构数组以及其字段本身就是数组的结构-到这一点,我可以使用该知识在给定情况下选择最佳结构.我确实发现了一些基本的东西,例如,在一个结构数组的每个结构中都将出现相同的字段名称(但如上面的示例所示,每个结构的每个字段都可以包含高度异构的数据类型和/或数组大小).

问题

谁能指出这样的单元格数组,结构数组和字段本身就是数组的标量结构之间的局限性比较?我正在寻找一种可以使编码人员决定(i)速度,(ii)内存和(iii)可读性,可维护性和可进化性之间最佳权衡的处理方法.

我故意省略了表,因为尽管我迷恋它们对数据集(及其表示形式)的方便访问和子集,但是事实证明它们在处理数据方面相当慢.它们有它们的用途,我比较自由地使用它们,但是出于比较的目的,我对它们不感兴趣,这是引擎盖下的算法编码.

解决方案

我认为您的问题最终会缩小为以下三种数据结构的类型":

单元格数组,结构数组和本身为数组的结构之间的比较局限性

[请注意,在这里我将其字段本身就是数组的结构"翻译为标量结构".结构数组也可以包含任意数组.我希望我的想法在下面变得清晰起来.]

对我来说,这些差别不大.这三个都是异构数据的容器. (异构数据是非统一数据,每个数据元素可能具有不同的类型和大小.)这些语句中的每一个都可以返回任何类型的数组,而与容器中其他数组的类型无关:

  • 单元格数组:array{i,j}

  • 结构数组:array(i,j).value

  • 标量结构:array.value

所以这完全取决于您要如何编制索引:

array(i,j).value
       ^     ^
       A     B

如果只想使用A进行索引,请使用单元格数组(当然,然后需要花括号).如果只想使用B进行索引,请使用标量结构.如果要同时使用AB,请使用结构数组.

我所知道的成本没有差异.这些容器中包含的每个数组都占用一些空间.各个容器的空间开销是相似的,我从来没有注意到时间开销的差异.

但是,这两者之间存在巨大差异:

array(i).value   % s1
array.value(i)   % s2

我认为这个问题也解决了这一差异. s1s2具有更多的空间开销:

>> s1=struct('value',num2cell(1:100))
s1 = 
  1×100 struct array with fields:
    value
>> s2=struct('value',1:100)
s2 = 
  struct with fields:
    value: [1×100 double]
>> whos
  Name      Size             Bytes  Class     Attributes
  s1        1x100            12064  struct              
  s2        1x1                976  struct              

数据需要800个字节,所以s2有176个字节的开销,而s1有11264(1408%)!

原因不是容器,而是事实,我们要存储一个数组,其中一个数组包含100个元素,而另一个数组中则存储100个数组,其中一个元素包含一个元素.每个数组都有一定大小的标头,MATLAB会使用该标头了解数组的类型,大小,管理存储和延迟复制机制.一个数组越少,使用的内存就越少.

因此,不要使用异构容器来存储标量!这些东西仅在需要存储更大的数组或不同类型或大小的数组时才有意义.


未明确询问(并且在明确询问 之后询问)的异构容器是表.表与标量结构相似,因为表的每一列都是单个数组,并且不同的列可以具有不同的类型.请注意,可以将单元格数组用作列,以允许将异类元素存储在列中,但是如果不是这种情况,则它们最有意义.

与标量结构的不同之处在于,每一列必须具有相同数量的行.另一个区别是索引看起来像单元格数组,标量结构或结构数组.

因此,该表对所包含的数据施加了一些约束,这在某些情况下非常有益.

但是,正如OP所指出的,使用表比使用结构要慢.这是因为table是自定义类,而不是结构和单元格数组之类的本机类型.如果在MATLAB中键入edit table,您将看到源代码及其实现方式.这是一个classdef文件,就像我们每个人都可以写的一样.因此,它具有相同的速度限制:JIT尚未针对其进行优化,索引到表中意味着运行的是写为M文件的函数,等等.


另一件事:不要创建结构的单元格数组,也不要创建带有单元格数组的标量结构.这增加了容器的水平,从而增加了开销(在空间和时间上),并使内容物更难以使用.我在这里看到了与此类结构导致的数据访问困难有关的问题:

data{i,j}.value   % A cell array with structs. Don't do this!
data.value{i,j}   % A struct with cell arrays. Don't do this!

第一个示例等于一个struct数组(开销更大),不同之处在于无法控制每个单元格中的struct字段.也就是说,一个单元格可能没有.value字段.

仅当value的大小与第二个struct字段的大小不同时,第二个示例才有意义.如果所有struct字段都是(假设是)具有相同大小的像元数组,则使用struct数组.再次,更少的开销和更多的一致性.

I've been using Matlab on and off for decades. I thought I had a good grip on arrays, structs, cell arrays, tables, an array of structs, and a struct in which each field is an array. For the latter two, I assumed that each field needed to be of uniform type. I'm finding that no such limitation exists:

Perhaps Matlab is becoming more flexible with the years (I'm using 2015b), but it does undermine my confidence in choosing the best type of variable for a task if I find that understanding of the limitations of each type is wrong. For the purpose of this question, I can't really articulate the needs of the task because the manner in which I break down a large to-do into tasks depends on my understanding of the data types at my disposal, and their advantages/limitations.

I can (and have) read online documentation ad nauseum, and while they will walk you through code to illustrate what the data types are able to do, I haven't yet come across a succinct description of the comparative limitations between cell arrays, arrays of structs, and structs whose fields are themselves arrays -- to the point that I can use that knowledge to choose the best structure in a given situation. Basic stuff, I do find, e.g., the same field names will occur in each struct of a struct array (but as the above example shows, each field of each struct can contain highly heterogeneous data types and/or array sizes).

THE QUESTION

Can anyone point to such a comparison of limitations between cell arrays, arrays of structs, and scalar structs whose fields are themselves arrays? I'm looking for a treatment at a level that informs a coder in deciding on the best trade-off between (i) speed, (ii) memory, and (iii) readability, maintainability, and evolvability.

I've deliberately left out tables because, although I'm enamoured of their convenient access to, and subsetting of, data sets (and presentation thereof), they have proved rather slow for manipulation of data. They have their uses, and I use them liberally, but I'm not interested in them for the purpose of this comparison, which is under-the-hood algorithm coding.

解决方案

I think your question eventually narrows down to these three "types" of data structures:

comparative limitations between cell arrays, arrays of structs, and structs whose fiels are themselves arrays

[Note that "structs whose fields are themselves arrays" I translate as "scalar structs" here. An array of structs can also contain arbitrary arrays. My thinking becomes clear below, I hope.]

To me, these are not very different. All three are containers for heterogeneous data. (Heterogeneous data is non-uniform data, each data element is potentially of a different type and size.) Each of these statements can return an array of any type, unrelated to the type of any other array in the container:

  • cell array: array{i,j}

  • struct array: array(i,j).value

  • scalar struct: array.value

So it all depends on how you want to index:

array(i,j).value
       ^     ^
       A     B

If you want to index using A only, use a cell array (though you then need curly braces, of course). If you want to index using B only, use a scalar struct. If you want both A and B, use a struct array.

There is no difference in cost that I'm aware of. Each of the arrays contained in these containers takes up some space. The spatial overhead of the various containers is similar, and I have never noted a time overhead difference.

However, there is a huge difference between these two:

array(i).value   % s1
array.value(i)   % s2

I think that the question deals with this difference also. s1 has a lot more spatial overhead than s2:

>> s1=struct('value',num2cell(1:100))
s1 = 
  1×100 struct array with fields:
    value
>> s2=struct('value',1:100)
s2 = 
  struct with fields:
    value: [1×100 double]
>> whos
  Name      Size             Bytes  Class     Attributes
  s1        1x100            12064  struct              
  s2        1x1                976  struct              

The data needs 800 bytes, so s2 has 176 bytes of overhead, whereas s1 has 11264 (1408%)!

The reason is not the container, but the fact that we're storing one array with 100 elements in one, and 100 arrays with one element in the other. Each array has a header of a certain size that MATLAB uses to know what type of array it is, what sizes it has, to manage its storage and the delayed copy mechanism. The fewer arrays one has, the less memory one uses.

So, don't use a heterogeneous container to store scalars! These things only make sense when you need to store larger arrays, or arrays of different type or size.


The heterogeneous container that is not explicitly asked about (and after the edit explicitly not asked about) is the table. A table is similar to a scalar struct in that each column of the table is a single array, and different columns can have different types. Note that it is possible to use a cell array as a column, allowing for heterogenous elements to be stored in a column, but they make most sense if this is not the case.

One difference with a scalar struct is that each column must have the same number of rows. Another difference is that indexing can look like that of a cell array, a scalar struct, or a struct array.

Thus, the table forces some constrains upon the contained data, which is very beneficial in some circumstances.

However, and as the OP noted, working with tables is slower than working with structs. This is because table is a custom class, not a native type like structs and cell arrays. If you type edit table in MATLAB, you'll see the source code, how it's implemented. It's a classdef file, just like something any of us could write. Consequently, it has the same speed limitations: the JIT is not optimized for it, indexing into a table implies running a function written as an M-file, etc.


One more thing: Don't create cell arrays of structs, or scalar structs with cell arrays. This increases the levels of containers, which increases overhead (both in space and time), and makes the contents more difficult to use. I have seen questions here on SO related to difficulty accessing data, caused by this type of construct:

data{i,j}.value   % A cell array with structs. Don't do this!
data.value{i,j}   % A struct with cell arrays. Don't do this!

The first example is equal to a struct array (with a lot more overhead), except there is no control over the struct fields within each cell. That is, it is possible for one of the cells to not have a .value field.

The second example makes sense only if value is a different size than a second struct field. If all struct fields are (supposed to be) cell arrays of the same size like this, then use a struct array. Again, less overhead and more uniformity.

这篇关于Matlab:结构数组的基本限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆