并行文件写入效率高吗? [英] Parallel file writing is it efficient?

查看:74
本文介绍了并行文件写入效率高吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道并行文件写入是否有效.确实,硬盘一次具有一个可用的读取头.因此,HDD可以一次执行一项任务. 但是下面的测试(在python中)与我的期望相矛盾:

I would like to know if parallel file writing is efficient. Indeed, a hard disk has one usable read head at a time. Thus the HDD can to do one task at a time. But below test (in python) contradict what I expecting:

要复制的文件约为1 Gb

The file to copy is around 1 Gb

脚本1(//读取和写入10次文件的任务):

Script 1 ( // task to read and write line by line 10 times a same file ):

#!/usr/bin/env python
from multiprocessing import Pool
def read_and_write( copy_filename ):
    with open( "/env/cns/bigtmp1/ERR000916_2.fastq", "r") as fori:
        with open( "/env/cns/bigtmp1/{}.fastq".format( copy_filename) , "w" ) as fout:
            for line in fori:
                fout.write( line + "\n" )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    pool = Pool(processes=4)
    results = pool.map( read_and_write, f_names )

if __name__ == "__main__":
    main()

脚本2(读取和写入同一文件10行的任务):

script 2 ( task to read and write line by line 10 times a same file ):

#!/usr/bin/env python
def read_and_write( copy_filename ):
    with open( "/env/cns/bigtmp1/ERR000916_2.fastq", "r") as fori:
        with open( "/env/cns/bigtmp1/{}.fastq".format( copy_filename) , "w" ) as fout:
            for line in fori:
                fout.write( line + "\n" )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    for n in f_names:
        result = read_and_write( n )

if __name__ == "__main__":
    main()

脚本3(//将同一文件复制10次的任务):

script 3 ( // task to copy 10 times a same file ):

#!/usr/bin/env python
from shutil import copyfile
from multiprocessing import Pool
def read_and_write( copy_filename ):
    copyfile( "/env/cns/bigtmp1/ERR000916_2.fastq", "/env/cns/bigtmp1/{}.fastq".format( copy_filename) )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    pool = Pool(processes=4)
    results = pool.map( read_and_write, f_names )

if __name__ == "__main__":
    main()

脚本4(将同一文件复制10次的任务):

script 4 ( task to copy 10 times a same file ):

#!/usr/bin/env python
from shutil import copyfile
def read_and_write( copy_filename ):
    copyfile( "/env/cns/bigtmp1/ERR000916_2.fastq", "/env/cns/bigtmp1/{}.fastq".format( copy_filename) )
    return copy_filename

def main():
    f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ]
    for n in f_names:
        result = read_and_write( n )

if __name__ == "__main__":
    main()

结果:

$ # // task to read and write line by line 10 times a same file
$ time python read_write_1.py

real    1m46.484s
user    3m40.865s
sys 0m29.455s

$ rm test_jm*
$ # task to read and write line by line 10 times a same file
$ time python read_write_2.py

real    4m16.530s
user    3m41.303s
sys 0m24.032s

$ rm test_jm*
$ # // task to copy 10 times a same file
$ time python read_write_3.py

real    1m35.890s
user    0m10.615s
sys 0m36.361s


$ rm test_jm*
$ # task to copy 10 times a same file
$ time python read_write_4.py

real    1m40.660s
user    0m7.322s
sys 0m25.020s
$ rm test_jm*

这些基本结果似乎表明//io读写更为有效.

These basics results seems to show that // io read and write is more efficient.

谢谢你点燃

推荐答案

我想知道并行文件写入是否有效.

I would like to know if parallel file writing is efficient.

简短的回答:物理上同时从多个线程写入同一磁盘,永远不会比从一个线程写入同一磁盘快(谈论 normal 硬盘).在某些情况下,它甚至可能慢很多.

Short answer: physically writing to the same disk from multiple threads at the same time, will never be faster than writing to that disk from one thread (talking about normal hard disks here). In some cases it can even be a lot slower.

但是,与往常一样,它取决于许多因素:

But, as always, it depends on a lot of factors:

  • OS磁盘缓存:写入通常由OS保留在缓存中,然后分块写入磁盘.因此,多个线程可以毫无问题地同时写入该高速缓存,并且这样做具有速度优势.尤其是如果数据的处理/准备花费的时间比磁盘的写入速度长.

  • OS disk caching: writes are usually kept in cache by the OS, and then written to the disk in chunks. So multiple threads can write to that cache simultaneously without a problem, and have a speed advantage doing so. Especially if the processing / preparing of the data takes longer than the writing speed of the disk.

在某些情况下,即使从多个线程直接写入物理磁盘,操作系统也会对此进行优化,并且仅将大块写入每个文件.

In some cases, even when writing directly to the physical disk from multiple threads, the OS will optimize this and only write large blocks to each file.

但是,在最坏的情况下,每次可能会将较小的块写入磁盘,从而导致每个文件切换都需要进行硬盘搜索(在正常的HDD上为±10ms!)(执行相同的操作) SSD上的状况不会太糟,因为可以进行更多的直接访问并且不需要寻道.

In the worst case scenario however, smaller blocks could be written to disk each time, resulting in the need for a hard disk seek (± 10ms on a normal hdd!) on every file-switch (doing the same on a SSD wouldn't be so bad because there is more direct access and no seeks are needed).

因此,通常,从多个线程同时写入磁盘时,最好在内存中准备(一些)数据,然后使用某种锁将最终数据以更大的块写入磁盘,或者来自一个专用的 writer-thread .如果文件在写入时不断增长(即未设置文件大小),则以较大的块写入数据还可以防止磁盘碎片化(至少要尽可能多).

So, in general, when writing to disk from multiple threads simultaneously, it might be a good idea to prepare (some) data in memory, and write the final data to disk in larger blocks using some kind of lock, or perhaps from one dedicated writer-thread. If the files are growing while being written to (i.e. no file size is set up front), writing the data in larger blocks could also prevent disk fragmentation (at least as much as possible).

在某些系统上可能根本没有区别,但是在其他系统上却可能有很大的不同,并且变得非常慢(甚至在具有不同硬盘的同一系统上).

On some systems there might be no difference at all, but on others it can make a big difference, and become a lot slower (or even on the same system with different hard disks).

要很好地测试使用单线程与多线程的写入速度之间的差异,文件总大小必须大于可用内存-或至少在测量结束时间之前将所有缓冲区刷新到磁盘.仅测量将数据写入OS磁盘缓存所需的时间在这里没有多大意义.

To have a good test of the differences in writing speeds using a single thread vs multiple threads, total file sizes would have to be bigger than the available memory - or at least all buffers should be flushed to disk before measuring the end time. Measuring only the time it takes to write the data to the OS disk cache wouldn't make much sense here.

理想情况下,将所有数据写入磁盘的总时间应等于物理硬盘的写入速度.如果使用一个线程向磁盘写入速度慢于磁盘写入速度(这意味着处理数据要比写入数据花费更长的时间),则显然使用更多线程将加快处理速度.如果从多个线程进行写入变得比磁盘写入速度慢,则由于在不同文件(或同一大文件内的不同块)之间切换而导致的磁盘寻道时间将会浪费.

Ideally, the total time measured to write all data to disk should equal the physical hard disk writing speed. If writing to disk using one thread is slower than the disk write speed (which means processing of the data takes longer than writing it), obviously using more threads will speed things up. If writing from multiple threads becomes slower than the disk write speed, time will be lost in disk seeks caused by switching between the different files (or different blocks inside the same big file).

要了解执行大量磁盘搜索时所浪费的时间,让我们看一些数字:

To get an idea of the loss in time when performing lots of disk seeks, let's look at some numbers:

说,我们有一个写入速度为50MB/s的硬盘:

Say, we have a hdd with a write speed of 50MB/s:

  • 写入一个50MB的连续块将花费1秒(在理想情况下).

  • Writing one contiguous block of 50MB would take 1 second (in ideal circumstances).

以1MB的块进行相同的操作,同时使用文件开关,并且在这之间进行磁盘寻道将产生:20ms的写入时间1MB + 10ms的寻道时间.写入50MB需要1.5秒.这会增加50%的时间,只是在两者之间进行快速搜索(从磁盘读取的保持时间相同-考虑到更快的读取速度,差异甚至会更大).

Doing the same in blocks of 1MB, with a file-switch and resulting disk seek in between would give: 20ms to write 1MB + 10ms seek time. Writing 50MB would take 1.5 seconds. that's a 50% increase in time, only to do a quick seek in between (the same holds for reading from disk as well - the difference will even be bigger, considering the faster reading speed).

实际上,它会介于两者之间,具体取决于系统.

In reality it will be somewhere in between, depending on the system.

尽管我们希望操作系统能够很好地解决所有问题(或通过使用 IOCP IOa 例如),并非总是如此.

While we could hope the OS takes good care of all that (or by using IOCP for example), this isn't always the case.

这篇关于并行文件写入效率高吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆