在MATLAB中直接写磁盘阵列到磁盘时,是否需要预分配? [英] When writing a large array directly to disk in MATLAB, is there any need to preallocate?

查看:187
本文介绍了在MATLAB中直接写磁盘阵列到磁盘时,是否需要预分配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要写一个太大的数组,以适合内存.mat二进制文件。这可以通过。因此,每当您碰到一个碎片时,HDD将移动到访问新的数据时会有额外的10ms延迟。 SSD的寻道时间要短得多(没有移动部件)。为了简单起见,我们忽略了多盘系统和RAID阵列!



如果您在不同的时间继续增长文件,那么您可能会遇到很多碎片。这实际上取决于何时/多少文件正在增长,以及如何使用硬盘。你遇到的性能问题也取决于你阅读文件的频率,以及你碰到这些文件的频率。

MATLAB将数据存储在列主要顺序,从评论看来,你似乎有兴趣进行列操作总和,平均值)。如果这些列在磁盘上变得不连续,那么在每一个操作中你都会碰到很多碎片。

正如评论中所提到的,读取和写入操作都将通过缓冲区来执行。正如@ user3666197指出,操作系统可以推测性地读取当前磁盘上的数据,因为您可能接下来需要这些数据。如果硬盘在某些时候处于闲置状态,这种行为特别有用 - 保持最大容量运行,并且缓冲存储器中的小部分数据可以大大提高读写性能。然而,从你的问题,它听起来好像你想对一个巨大的(内存太大).mat文件执行大的操作。考虑到你的使用情况,硬盘将会以的容量运行,并且数据文件太大而不能放入缓冲区 - 所以这些特殊的技巧不能解决你的问题。 / p>

所以...是的,你应该预先分配。是的,磁盘阵列的性能会受到影响。是的,这可能是重要的(这取决于具体的增长量,碎片等)。如果你真的要进入高性能计算的精神,那么就停止你正在做的事情吧,扔掉MATLAB,分割你的数据并尝试一些类似Apache Spark的东西!但这是另一回事。



这是否回答您的问题?

更正/修正案欢迎!我被带到了POSIX索引节点,真诚的道歉,如果有任何不准确的地方在这里... ... $ / b $ b

I need to write an array that is too large to fit into memory to a .mat binary file. This can be accomplished with the matfile function, which allows random access to a .mat file on disk.

Normally, the accepted advice is to preallocate arrays, because expanding them on every iteration of a loop is slow. However, when I was asking how to do this, it occurred to me that this may not be good advice when writing to disk rather than RAM.

Will the same performance hit from growing the array apply, and if so, will it be significant when compared to the time it takes to write to disk anyway?

(Assume that the whole file will be written in one session, so the risk of serious file fragmentation is low.)

解决方案

Q: Will the same performance hit from growing the array apply, and if so will it be significant when compared to the time it takes to write to disk anyway?

A: Yes, performance will suffer if you significantly grow a file on disk without pre-allocating. The performance hit will be a consequence of fragmentation. As you mentioned, fragmentation is less of a risk if the file is written in one session, but will cause problems if the file grows significantly.

A related question was raised on the MathWorks website, and the accepted answer was to pre-allocate when possible.

If you don't pre-allocate, then the extent of your performance problems will depend on:

  • your filesystem (how data are stored on disk, the cluster-size),
  • your hardware (HDD seek time, or SSD access times),
  • the size of your mat file (whether it moves into non-contiguous space),
  • and the current state of your storage (existing fragmentation / free space).

Let's pretend that you're running a recent Windows OS, and so are using the NTFS file-system. Let's further assume that it has been set up with the default 4 kB cluster size. So, space on disk gets allocated in 4 kB chunks and the locations of these are indexed to the Master File Table. If the file grows and contiguous space is not available then there are only two choices:

  1. Re-write the entire file to a new part of the disk, where there is sufficient free space.
  2. Fragment the file, storing the additional data at a different physical location on disk.

The file system chooses to do the least-bad option, #2, and updates the MFT record to indicate where the new clusters will be on disk.

Now, the hard disk needs to physically move the read head in order to read or write the new clusters, and this is a (relatively) slow process. In terms of moving the head, and waiting for the right area of disk to spin underneath it ... you're likely to be looking at a seek time of about 10ms. So for every time you hit a fragment, there will be an additional 10ms delay whilst the HDD moves to access the new data. SSDs have much shorter seek times (no moving parts). For the sake of simplicity, we're ignoring multi-platter systems and RAID arrays!

If you keep growing the file at different times, then you may experience a lot of fragmentation. This really depends on when / how much the file is growing by, and how else you are using the hard disk. The performance hit that you experience will also depend on how often you are reading the file, and how frequently you encounter the fragments.

MATLAB stores data in Column-major order, and from the comments it seems that you're interested in performing column-wise operations (sums, averages) on the dataset. If the columns become non-contiguous on disk then you're going to hit lots of fragments on every operation!

As mentioned in the comments, both read and write actions will be performed via a buffer. As @user3666197 points out the OS can speculatively read-ahead of the current data on disk, on the basis that you're likely to want that data next. This behaviour is especially useful if the hard disk would be sitting idle at times - keeping it operating at maximum capacity and working with small parts of the data in buffer memory can greatly improve read and write performance. However, from your question it sounds as though you want to perform large operations on a huge (too big for memory) .mat file. Given your use-case, the hard disk is going to be working at capacity anyway, and the data file is too big to fit in the buffer - so these particular tricks won't solve your problem.

So ...Yes, you should pre-allocate. Yes, a performance hit from growing the array on disk will apply. Yes, it will probably be significant (it depends on specifics like amount of growth, fragmentation, etc). And if you're going to really get into the HPC spirit of things then stop what you're doing, throw away MATLAB , shard your data and try something like Apache Spark! But that's another story.

Does that answer your question?

P.S. Corrections / amendments welcome! I was brought up on POSIX inodes, so sincere apologies if there are any inaccuracies in here...

这篇关于在MATLAB中直接写磁盘阵列到磁盘时,是否需要预分配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆