Matlab表/数据集类型优化 [英] Matlab Table / Dataset type optimization

查看:55
本文介绍了Matlab表/数据集类型优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Matlab中搜索"observations-variables"表的一些优化数据类型,可以通过列(通过变量)和行(通过观察)快速轻松地访问.

I am searching some optimized datatypes for "observations-variables" table in Matlab, that can be fast and easily accessed by columns (through variables) and by rows (through observations).

这里是对现有Matlab数据类型的比较:

Here is сomparison of existing Matlab datatypes:

  1. 矩阵非常快速,强大,它没有针对其维度的内置索引标签/枚举,并且您无法始终记住按列索引的变量名.
  2. 的性能非常差,尤其是在for循环中读取单个行/列时(我认为它运行一些慢速转换方法,并且设计得更像Excel).
  3. li>
  4. 标量结构(列数组的结构)数据类型-快速以变量的形式按列访问向量,但缓慢地将行按行转换为观测值.
  5. 非标量结构(结构数组)-快速以向量的形式按行访问观察结果,但慢速地按列将其转换为变量.
  1. Matrix is very fast, hovewer, it has no built-in indexing labels/enumerations for its dimensions, and you can't always remember variable name by column index.
  2. Table has very bad performance, especially when reading individual rows/columns in a for loop (I suppose it runs some slow convertion methods, and is designed to be more Excel-like).
  3. Scalar structure (structure of column arrays) datatype - fast column-wise access to variables as vectors, but slow row-wise conversion to observations.
  4. Nonscalar structure (array of structures) - fast row-wise access to observations as vectors, but slow column-wise conversion to variables.

我想知道是否可以只使用数字变量-OR-任何变量类型来组合行号索引和列变量索引,是否可以使用一些更简单和优化的Table数据类型版本.

I wonder if I can use some simpler and optimized version of Table data type, if I want just to combine row-number and column-variable indexing with only numerical variables -OR- any variable type.

测试脚本的结果:

----
TEST1 - reading individual observations
Matrix: 0.072519 sec
Table: 18.014 sec
Array of structures: 0.49896 sec
Structure of arrays: 4.3865 sec
----
TEST2 - reading individual variables
Matrix: 0.0047834 sec
Table: 0.0017972 sec
Array of structures: 2.2715 sec
Structure of arrays: 0.0010529 sec

测试脚本:

Nobs = 1e5; % number of observations-rows
varNames = {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'};
Nvar = numel(varNames); % number of variables-colums

M = randn(Nobs, Nvar); % matrix

T = array2table(M, 'VariableNames', varNames); % table

NS = struct; % nonscalar structure = array of structures
for i=1:Nobs
    for v=1:Nvar
        NS(i).(varNames{v}) = M(i,v);
    end
end

SS = struct; % scalar structure = structure of arrays
for v=1:Nvar
    SS.(varNames{v}) = M(:,v);
end

%% TEST 1 - reading individual observations (row-wise)
disp('----'); disp('TEST1 - reading individual observations');

tic; % matrix
for i=1:Nobs
   x = M(i,:); end
disp(['Matrix: ', num2str(toc()), ' sec']);

tic; % table
for i=1:Nobs
   x = T(i,:); end
disp(['Table: ', num2str(toc), ' sec']);

tic;% nonscalar structure = array of structures
for i=1:Nobs
    x = NS(i); end
disp(['Array of structures: ', num2str(toc()), ' sec']);

tic;% scalar structure = structure of arrays 
for i=1:Nobs
    for v=1:Nvar
        x.(varNames{v}) = SS.(varNames{v})(i);
    end
end
disp(['Structure of arrays: ', num2str(toc()), ' sec']);

%% TEST 2 - reading individual variables (column-wise)
disp('----'); disp('TEST2 - reading individual variables');

tic; % matrix
for v=1:Nvar
   x = M(:,v); end
disp(['Matrix: ', num2str(toc()), ' sec']);

tic; % table
for v=1:Nvar
   x = T.(varNames{v}); end
disp(['Table: ', num2str(toc()), ' sec']);

tic; % nonscalar structure = array of structures
for v=1:Nvar
    for i=1:Nobs
        x(i,1) = NS(i).(varNames{v});
    end
end
disp(['Array of structures: ', num2str(toc()), ' sec']);

tic; % scalar structure = structure of arrays
for v=1:Nvar
    x = SS.(varNames{v}); end
disp(['Structure of arrays: ', num2str(toc()), ' sec']);

推荐答案

我将使用矩阵,因为它们是最快,最直接的使用方法,然后创建一组枚举的列标签以使索引列更容易.这里有几种方法可以做到这一点:

I would use matrices, since they're the fastest and most straightforward to use, and then create a set of enumerated column labels to make indexing columns easier. Here are a few ways to do this:


给出您的变量名,并假设它们按顺序从第1列到N列进行映射,您可以像这样创建映射:

Given your variable names, and assuming they map in order from columns 1 through N, you can create a mapping like so:

varNames = {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'};
col = containers.Map(varNames, 1:numel(varNames));

现在,您可以使用地图通过变量名访问数据列.例如,如果要从矩阵data中获取变量AC的列(即第一个和第三个),则可以执行以下操作:

And now you can use the map to access columns of your data by variable name. For example, if you want to fetch the columns for variables A and C (i.e. the first and third) from a matrix data, you would do this:

subData = data(:, [col('A') col('C')]);


您可以像这样创建一个结构,以变量名作为其字段,并以相应的列索引作为其值:

You can create a structure with the variable names as its fields and the corresponding column indices as their values like so:

enumData = [varNames; num2cell(1:numel(varNames))];
col = struct(enumData{:});

这是col包含的内容:

struct with fields:

  A: 1
  B: 2
  C: 3
  D: 4
  E: 5
  F: 6
  G: 7
  H: 8
  I: 9
  J: 10
  K: 11
  L: 12
  M: 13
  N: 14
  O: 15

您将像这样访问列AC:

subData = data(:, [col.A col.C]);
% ...or with dynamic field names...
subData = data(:, [col.('A') col.('C')]);


可以在工作区中为每个列名称创建一个变量,然后将列索引存储在其中.这将使用更多的变量污染您的工作空间,但为您提供了一种访问列数据的简洁方法.这是使用容易受到攻击的 eval 的简单方法>:

You could just create a variable in your workspace for every column name and store the column indices in them. This will pollute your workspace with more variables, but gives you a terse way to access column data. Here's an easy way to do it, using the much-maligned eval:

enumData = [varNames; num2cell(1:numel(varNames))];
eval(sprintf('%s=%d;', enumData{:}));

访问列AC一样简单:

subData = data(:, [A C]);


这可能是一个很好的矫over过正的方法,但是如果您要对许多分析使用相同的列标签和索引映射,则可以创建一个枚举类,将其保存在您的某处 MATLAB路径,并且永远不要不必担心再次定义列枚举.例如,这是一个带有15个枚举值的ColVar类:

This is probably a good dose of overkill, but if you're going to use the same mapping of column labels and indices for many analyses you could create an enumeration class, save it somewhere on your MATLAB path, and never have to worry about defining your column enumerations again. For example, here's a ColVar class with 15 enumerated values:

classdef ColVar < double
  enumeration
    A (1)
    B (2)
    C (3)
    D (4)
    E (5)
    F (6)
    G (7)
    H (8)
    I (9)
    J (10)
    K (11)
    L (12)
    M (13)
    N (14)
    O (15)
  end
end

您将像这样访问列AC:

subData = data(:, [ColVar.A ColVar.C]);

这篇关于Matlab表/数据集类型优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆