同时读取多个文件是个好主意吗? [英] Is it a good idea to read multiple files at the same time?

查看:144
本文介绍了同时读取多个文件是个好主意吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们公司的一台服务器具有32个CPU,并且我们要处理1000多个非常大的文件.我不确定同时读取32个文件是否是个好主意,以便所有内核也可以同时执行独立的计算.谁能简要解释一下硬盘的工作原理?如果我同时读取32个文件,会降低读取速度吗?谢谢!

One of our company's server has 32 CPUs, and we have 1000+ very large files to be processed. I'm not sure if it is a good idea to read 32 files at the same time so all cores can perform independent calculations at the same time too. Could anyone briefly explain how hard disk works? If I read 32 files at the same time, would that slow down the reading speed? Thanks!

推荐答案

硬盘传统上是机械数据存储设备.我假设服务器使用的是机械硬盘,而不是没有活动部件的较新的SSD类型硬盘.我还假设具有如此大的数据和处理能力,正在使用多个硬盘(RAID或NAS).这些细节可能会严重影响性能,并可能导致以下内容不准确.

The hard disk is traditionally a mechanical data storage device. I'm assuming the server uses mechanical ones, and not the newer SSD type of hard disks, which have no moving parts. I'm also assuming with this much data and processing power, that more than one hard disk is being used (RAID or NAS.) These details can affect the performance significantly, and could render much of the following as inaccurate.

作为机械设备的硬盘在内部具有旋转磁盘,就像老式的电唱机或CD一样.它涂有可以记录和播放微小电磁脉冲的材料.可定位的读-写"模块包括:磁头正好在每个磁盘表面的正上方飞行,通常在磁盘的两侧,准备在每个磁盘表面上移动以定位,读取和写入这些磁脉冲.旋转和运动都需要时间.越工作"就越容易实现.允许使用磁盘来完成它需要花费更长的时间,这仅仅是因为它必须物理上在磁盘表面上定位更多的微观区域.

Hard disks, being mechanical devices, have a spinning disc inside like an old-fashioned record player or CD. It is coated with a material that can record and playback tiny magnetic pulses. A positionable "read-write" head flies right above the surface of each disk, usually on both sides of it, ready to move across the surface of each disk to locate, read, and write these magnetic pulses. Both the spinning and movement take time. The more "work" a disk is given to do, the longer it takes to finish, simply because it has to physically locate more microscopic areas on the surface of the disks.

也就是说,假设有29名员工被分配来阅读《不列颠百科全书》的全部29卷. (当然是3个主管.)每个卷都存储在一个硬盘上,因此有29个硬盘.可以通过两种方式读取整个内容:

That said, imagine there are 29 employees assigned to read all 29 volumes of the Encyclopedia Brittanica. (3 supervisors, of course.) Each volume is stored on one hard disk, so there are 29 hard disks. There are two ways in which the whole thing can be read:

  1. 领取第一卷,然后依次让每位员工一次开始阅读一页,直到所有卷都读完为止.主管收集并处理所有页面并对其重新排序,一次一卷.
  2. 同时拾取所有29卷,并尝试以基本上随机的方式读取页面(净效果),直到所有卷都读完为止.主管在处理过程中从29个随机章节中收集所有页面并对其重新排序...

选项#1似乎过时",但是有关此方法的重要一件事是,其他28个磁盘根本没有使用.只有一个是.硬盘在顺序读取数据方面远胜于随机读取.这是因为顺序读取避免了读写头来回搜寻造成的延迟.

Option #1 seems "antiquated", however one important thing about this method is that the other 28 disks are not being used at all. Only one is. Hard disks are far better at reading data sequentially than randomly. This is because sequential reading avoids the delays caused by the read-write heads seeking back and forth.

选项#2可以工作,并且听起来很合理,但是由于两个原因,它并不理想:a)几乎没有顺序读取,b)所有磁盘都在使用中.这样会消耗更多功率,并且对服务器要同时运行所有这些磁盘提出了更高的要求.

Option #2 would work, and sounds reasonable, but it isn't ideal for two reasons: a) almost no sequential reading, and b) all of the disks are in use. This uses more power and puts a bigger demand on the server to run all of those disks concurrently.

是的,如果您尝试同时处理32个大文件,那么这将对磁盘造成巨大的负担,并且它们可能会缓慢爬行.让32个核轮流"运行更复杂,但可能是更好的解决方案.一次处理这些大文件之一,直到全部处理完毕. (通过轮流",我的意思是将其分解为更小,更易于管理的块.)再次,目标是使磁盘尽可能顺序地读取,并避免随机来回查找.

So yes, if you try to process 32 huge files simultaneously, then that is going to place a tremendous load on the disks, and they will probably slow to a crawl. It is more complicated, but likely a better solution, to have the 32 cores "take turns" with one of those huge files at a time until they are all processed. (By "take turns" I mean break it up into smaller, more manageable chunks.) Again, the goal is to make the disks read as sequentially as possible, and avoid random seeking-back-and-forth.

要实现此目的的软件必须是多线程,这意味着只有一个程序由用户启动,但是它创建31个新的工人线程"程序.用于其他CPU内核.主程序开始按顺序读取数据,并将输入的数据拆分为多个块,以供其他线程(核心)处理.那些人然后轮流"转过身来.处理整个数据文件的一小部分,直到完全处理为止.

Software to accomplish this must be multi-threaded, meaning that just one program is started by the user, but it creates 31 new "worker threads" for the other CPU cores. The main program starts reading data, sequentially, and splits this incoming data off into chunks for the other threads (cores) to process. Those all then "take turns" crunching small pieces of the whole data file, until it is completely processed.

这篇关于同时读取多个文件是个好主意吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆