提取大型Matlab数据集子集 [英] Extract large Matlab dataset subsets

查看:303
本文介绍了提取大型Matlab数据集子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

引用和分配matlab数据集的子集似乎效率极低,并且可能像行^ 2那样缩放

Referencing and assigning a subset of a matlab dataset appears to be extremely inefficient and possibly scales like rows^2

示例:

alldata是一个包含混合数据的大型数据集-例如150,000行乘25列(整数,布尔值和字符串).

alldata is a large dataset of mixed data - say 150,000 rows by 25 columns (integer, boolean and string).

数据集的格式为:

'format', '%s%u%u%u%u%u%s%s%s%s%s%s%s%u%u%u%u%s%u%s%s%u%s%s%s%s%u%s%u%s%s%s%u%s'

然后我将2个类型的整数cols转换为布尔型

I then convert 2 type integer cols into type boolean

以下子集分配:

somedata = alldata(1:m,:)

对于m = 10,000,

花费> 7秒,对于更大的m值,花费的时间荒谬.绘制时间vs m显示了m ^ 2类型关系,这很奇怪,因为复制所有数据几乎是瞬时的,就像使用sortrow和find这样的函数一样.实际上,对于较大的m值,读取原始.csv数据文件的速度比上述分配要快.

takes >7 sec for m = 10,000 and ridiculous amounts of time for larger values of m. Plotting time vs m shows a m^2 type relationship which is strange, given that copying alldata is nearly instantaneous, as is using functions like sortrows and find. In fact reading the original .csv data file in is faster than the above assignment for large values of m.

使用事件探查器,似乎有一个功能子引用,其中包括一条非常慢的行,该行检查字符串比较以确定数据集中的唯一值.这与数据集类型的存储方式(即参考表)有关吗?该数据集包含大量唯一的字符串值.

Using the profiler, it appears there is a function subref that includes a very slow line that checks for string comparisons to determine unique values within the dataset. Is this related to how the dataset type is stored (i.e. a reference table)? The dataset includes large number of unique string values.

在matlab中提取数据集的子集是否有解决方案?例如预分配(如何?),或者复制数据集并删除行,而不是分配提取/子集.

Are their any solutions to extracting a subset of a dataset in matlab? Such as preallocation (how?), or copying the dataset and deleting rows rather than assigning an extract/subset.

我正在使用具有1.5Gb内存的双核计算机,但是任务管理器报告正在使用的ram少于1Gb.

I am using a dual core machine with 1.5Gb ram, but task manager reports less than 1Gb of ram in use.

推荐答案

我以前使用过MATLAB的观察名称(ObsNames)属性

I have previously worked with MATLAB's dataset array for large data, unfortunately its true that they do suffer from performance issues. One thing I found which helps speed things up, is to clear the observation names (ObsNames) property

尝试以下修复程序:

%# I assume you have a 'dataset' object
ds = dataset(...);

%# clear the observation names property (It simply a label for each record)
ds.Properties.ObsNames = [];

这篇关于提取大型Matlab数据集子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆