并行磁盘I/O [英] Parallel Disk I/O

查看:177
本文介绍了并行磁盘I/O的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个要读取的日志文件.在不失一般性的前提下,假设日志文件的处理如下:

def process(infilepath):
    answer = 0
    with open (infilepath) as infile:
        for line in infile:
            if line.startswith(someStr):
                answer += 1
    return answer

由于我有很多日志文件,所以我想对这个问题进行多处理(我的第一个错误:我应该使用多线程;有人请告诉我原因)

这样做的时候,我想到任何形式的并行处理在这里实际上都应该是无用的,因为我受制于这样一个事实,即我的HDD上只有一个读头,因此,可能只有一个文件是一次阅读.实际上,根据这种推理,由于可能同时请求来自不同文件的行,因此读取头可能需要不时地显着移动,从而导致多进程方法比串行方法慢.因此,我决定返回一个过程来读取我的日志文件.

但是有趣的是,我注意到使用小文件(<= 40KB)确实可以提高速度,并且只有大文件(> = 445MB)才可以看到预期的速度降低.

这使我相信python可以分块读取文件,其大小超过我一次请求的一行.

问题1:那么,什么是文件读取机制?

第二季度:什么是优化从常规HDD 读取文件的最佳方法?

技术规格:

  • python3.3
  • 5400rpm常规硬盘
  • Mac OSX 10.9.2(小牛)

解决方案

观察到的行为是由于以下原因导致的:

  1. BufferedIO
  2. 一种调度算法,用于决定读取HDD所需扇区的顺序

BufferedIO

取决于操作系统和读取的块大小,整个文件有可能装入一个块,这是在单个读取命令中读取的块.这就是为什么较小的文件更容易读取的原因

计划算法

较大的文件(文件大小>读取块大小)必须以block size块读取.因此,当请求读取多个文件中的每个文件时(由于进行了多重处理),针必须移动到HDD的不同扇区(对应于文件所在的位置).这种重复运动有两件事:

  1. 增加对同一文件的连续读取之间的时间
  2. 删除读扇区预测变量,因为文件可能跨越多个扇区

如果在读取头可以提供同一文件的下一行代码之前完成对一行代码的计算,则连续读取同一文件之间的时间很重要,该过程只是等待直到另一行代码可用为止.这是放缓的原因之一

由于与为什么要退出分支预测器很差,因此抛出读预测器是很差的

结合这两个问题,并行处理许多大文件比串行处理要慢.当然,在从硬盘驱动器中读取numProcesses * blockSize条多行之前完成处理blockSize条多行时更是如此.

I have several logfiles that I would like to read. Without loss of generality, let's say the logfile processing is done as follows:

def process(infilepath):
    answer = 0
    with open (infilepath) as infile:
        for line in infile:
            if line.startswith(someStr):
                answer += 1
    return answer

Since I have a lot of logfiles, I wanted to throw multiprocessing at this problem (my first mistake: I should have probably used multi-threading; someone please tell me why)

While doing so, it occurred to me that any form of parallel processing should be effectively useless here, since I'm constrained by the fact that there is only one read head on my HDD, and therefore, only one file may be read at a time. In fact, under this reasoning, due to the fact that lines from different files may be requested simultaneously, the read head may need to move significantly from time to time, causing the multiproc approach to be slower than a serial approach. So I decided to go back to a single process to read my logfiles.

Interestingly though, I noticed that I did get a speedup with small files (<= 40KB), and that it was only with large files (>= 445MB) that the expected slow-down was noticed.

This leads me to believe that python may read files in chunks, whose size exceeds more than the one line I request at a time.

Q1: So what is the file-reading mechanism under the hood?

Q2: What is the best way to optimize the reading of files from a conventional HDD?

Technical specs:

  • python3.3
  • 5400rpm conventional HDD
  • Mac OSX 10.9.2 (Mavericks)

解决方案

The observed behavior is a result of:

  1. BufferedIO
  2. a scheduling algorithm that decides the order in which the requisite sectors of the HDD are read

BufferedIO

Depending on the OS and the read block size, it is possible for the entire file to fit into one block, which is what is read in a single read command. This is why the smaller files are read more easily

Scheduling Algorithm

Larger files (filesize > read block size), have to be read in block size chunks. Thus, when a read is requested on each of several files (due to the multiprocessing), the needle has to move to different sectors (corresponding to where the files live) of the HDD. This repetitive movement does two things:

  1. increase the time between successive reads on the same file
  2. throw off the read-sector predictor, as a file may span multiple sectors

The time between successive reads of the same file matters if the computation performed on a chunk of lines completes before the read head can provide the next chunk of lines from the same file, the process simply waits until another chunk of lines becomes available. This is one source of slowdowns

Throwing off the read-predictor is bad for pretty much the same reasons as why throwing off the branch predictor is bad.

With the combined effect of these two issues, processing many large files in parallel would be slower than processing them serially. Of course, this is more true when processing blockSize many lines finishes before numProcesses * blockSize many lines can be read out of the HDD

这篇关于并行磁盘I/O的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆