MATLAB .mat文件中的开销过大 [英] Excessively large overhead in MATLAB .mat file

查看:120
本文介绍了MATLAB .mat文件中的开销过大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析一个充满数据的大文本文件,然后将其作为* .mat文件保存到磁盘,以便我可以轻松地仅加载其中的一部分(请参阅

I am parsing a large text file full of data and then saving it to disk as a *.mat file so that I can easily load in only parts of it (see here for more information on reading in the files, and here for the data). To do so, I read in one line at a time, parse the line, and then append it to the file. The problem is that the file itself is >3 orders of magnitude larger than the data contained therein!

这是我的代码的精简版:

Here is a stripped down version of my code:

database = which('01_hit12.par');
[directory,filename,~] = fileparts(database);
matObj = matfile(fullfile(directory,[filename '.mat']),'Writable',true);

fidr = fopen(database);
hitranTemp = fgetl(fidr);
k = 1;
while ischar(hitranTemp)
    if abs(hitranTemp(1)) == 32;
        hitranTemp(1) = '0';
    end

    hitran = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2u%1c%7f%7f','delimiter','','whitespace','');

    matObj.moleculeNumber(1,k)      = uint8(hitran{1});
    matObj.isotopeologueNumber(1,k) = uint8(hitran{2});
    matObj.vacuumWavenumber(1,k)    = hitran{3};
    matObj.lineIntensity(1,k)       = hitran{4};
    matObj.airWidth(1,k)            = single(hitran{6});
    matObj.selfWidth(1,k)           = single(hitran{7});
    matObj.lowStateE(1,k)           = single(hitran{8});
    matObj.tempDependWidth(1,k)     = single(hitran{9});
    matObj.pressureShift(1,k)       = single(hitran{10});

    if rem(k,1e4) == 0;
        display(sprintf('line %u (%2.2f)',k,100*k/K));
    end
    hitranTemp = fgetl(fidr);
    k = k + 1;
end
fclose(fidr);

在解析了224,515行中的13,813行之后,我停止了代码,因为这花费了很长时间并且文件大小越来越大,但是最后的打印输出表明我只清除了1万行.我清除了内存,然后运行:

I stopped the code after 13,813 of the 224,515 lines had been parsed because it had been taking a very long time and the file size was getting huge, but the last printout indicated that I had only just cleared 10k lines. I cleared the memory, and then ran:

S = whos('-file','01_hit12.mat');
fileBytes = sum([S.bytes]);

T = dir(which('01_hit12.mat'));
diskBytes = T.bytes;

disp([fileBytes diskBytes diskBytes/fileBytes])

并获得输出:

524894 896189009 1707.37141022759

什么占用了额外的895,664,115字节?我知道帮助页面上说应该有一些额外的开销,但是我觉得将近Gb的描述性标题有点多余!

What is taking up the extra 895,664,115 bytes? I know the help page says there should be a little extra overhead, but I feel that nearly a Gb of descriptive header is a bit excessive!

新信息:
我尝试预分配文件,以为当矩阵在循环中嵌入矩阵并在每次写入时为整个矩阵重新分配大块磁盘空间时,MATLAB可能会做同样的事情,不是吗.用适当数据类型的零填充文件会导致我的简短检查脚本返回一个文件:

New information:
I tried pre-allocating the file, thinking that perhaps MATLAB was doing the same thing it does when a matrix is embiggened in a loop and reallocating a chunk of disk space for the entire matrix on each write, and that isn't it. Filling the file with zeros of the appropriate data types results in a file that my short check script returns:

8531570 71467 0.00837677004349727

这对我来说更有意义. Matlab将稀疏地保存文件,因此磁盘文件的大小比内存中完整矩阵的大小小得多.但是,一旦开始用实际数据替换值,我就会得到与以前相同的行为,并且文件大小开始飙升,超出了所有合理范围.

This makes more sense to me. Matlab is saving the file sparsely, so the disk file size is much smaller than the size of the full matrix in memory. Once it starts replacing values with real data, however, I get the same behavior as before and the file size starts skyrocketing beyond all reasonable bounds.

新的新信息:
在数据的子集(长度为100行)上进行了尝试.要流式传输到磁盘,数据必须为v7.3格式,因此我通过脚本运行了子集,将其加载到内存中,然后重新保存为v7.0格式.结果如下:

New new information:
Tried this on a subset of the data, 100 lines long. To stream to disk, the data has to be in v7.3 format, so I ran the subset through my script, loaded it into memory, and then resaved as v7.0 format. Here are the results:

v7.3: 3800 8752 2.30
v7.0: 3800 2561 0.67

难怪v7.3格式不是默认格式.有谁知道解决这个问题的方法吗?这是错误还是功能?

No wonder the v7.3 format isn't the default. Does anyone know a way around this? Is this a bug or a feature?

推荐答案

对我来说,这似乎是个错误.一种解决方法是分块写入到预分配的数组中.

This seems like a bug to me. A workaround is to write in chunks to pre-allocated arrays.

首先分配:

fid = fopen('01_hit12.par', 'r');
data = fread(fid, inf, 'uint8');
nlines = nnz(data == 10) + 1;
fclose(fid);

matObj.moleculeNumber = zeros(1,nlines,'uint8');
matObj.isotopeologueNumber = zeros(1,nlines,'uint8');
matObj.vacuumWavenumber = zeros(1,nlines,'double');
matObj.lineIntensity = zeros(1,nlines,'double');
matObj.airWidth = zeros(1,nlines,'single');
matObj.selfWidth = zeros(1,nlines,'single');
matObj.lowStateE = zeros(1,nlines,'single');
matObj.tempDependWidth = zeros(1,nlines,'single');
matObj.pressureShift = zeros(1,nlines,'single');

然后以10000个块的形式编写,我将您的代码修改如下:

Then to write in chunks of 10000, I modified your code as follows:

... % your code plus pre-alloc first
bs = 10000;
while ischar(hitranTemp)
    if abs(hitranTemp(1)) == 32;
        hitranTemp(1) = '0';
    end

    for ii = 1:bs,
        hitran{ii} = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2    u%1c%7f%7f','delimiter','','whitespace','');
        hitranTemp = fgetl(fidr);
        if hitranTemp==-1, bs=ii; break; end
    end

    % this part really ugly, sorry! trying to keep it compact...
    matObj.moleculeNumber(1,k:k+bs-1)      = uint8(builtin('_paren',cellfun(@(c)c{1},hitran),1:bs));
    matObj.isotopeologueNumber(1,k:k+bs-1) = uint8(builtin('_paren',cellfun(@(c)c{2},hitran),1:bs));
    matObj.vacuumWavenumber(1,k:k+bs-1)    = builtin('_paren',cellfun(@(c)c{3},hitran),1:bs);
    matObj.lineIntensity(1,k:k+bs-1)       = builtin('_paren',cellfun(@(c)c{4},hitran),1:bs);
    matObj.airWidth(1,k:k+bs-1)            = single(builtin('_paren',cellfun(@(c)c{5},hitran),1:bs));
    matObj.selfWidth(1,k:k+bs-1)           = single(builtin('_paren',cellfun(@(c)c{6},hitran),1:bs));
    matObj.lowStateE(1,k:k+bs-1)           = single(builtin('_paren',cellfun(@(c)c{7},hitran),1:bs));
    matObj.tempDependWidth(1,k:k+bs-1)     = single(builtin('_paren',cellfun(@(c)c{8},hitran),1:bs));
    matObj.pressureShift(1,k:k+bs-1)       = single(builtin('_paren',cellfun(@(c)c{9},hitran),1:bs));

    k = k + bs;
    fprintf('.');
end
fclose(fidr);

磁盘上的最终大小为21,393,408字节.用法细分为,

The final size on disk is 21,393,408 bytes. The usage breaks down as,

>> S = whos('-file','01_hit12.mat');
>> fileBytes = sum([S.bytes]);
>> T = dir(which('01_hit12.mat'));
>> diskBytes = T.bytes; ratio = diskBytes/fileBytes;
>> fprintf('%10d whos\n%10d disk\n%10.6f\n',fileBytes,diskBytes,ratio)
   8531608 whos
  21389582 disk
  2.507099

效率仍然很低,但并没有失控.

Still fairly inefficient, but not out of control.

这篇关于MATLAB .mat文件中的开销过大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆