我应该如何优化这个文件系统的I / O绑定程序? [英] How should I optimize this filesystem I/O bound program?

查看:107
本文介绍了我应该如何优化这个文件系统的I / O绑定程序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


  1. 从csv文件读取一行。



  2. 将它们分解成实际的行,并将其写入数据库。
  3. 返回步骤1,除非文件已被完全读取。

  4. 运行SQL * Loader并加载这些文件进入数据库。

第6步实际上并没有太多时间。这似乎是第四步,占用大部分时间。在大多数情况下,我想优化这个处理在一个四核服务器与一些RAID设置的低百万的记录集。



<有几个想法,我必须解决这个问题:
$ b $ ol
  • 从第一步读取整个文件(或者至少读取它非常大的块),并将文件作为一个整体或非常大的块写入磁盘。这个想法是,硬盘将花费更少的时间在文件之间来回。这样做会缓冲吗?

  • 将步骤1,步骤2和步骤3以及步骤4并行化为单独的进程。这将使步骤1,2和3不必等到4完成。

  • 将加载文件分解成单独的块并并行处理它们。行不需要按照任何顺序处理。这可能需要以某种方式与步骤2相结合。

  • 当然,这个问题的正确答案是做你发现的通过测试是最快的。不过,我主要想知道我该先把时间花在哪里。有没有人在这些问题上有更多的经验有什么建议?

    解决方案Python已经做IO缓冲和操作系统 / em>处理预取输入文件和延迟写入,直到它需要内存用于别的东西,或者只是因为在RAM中弄脏数据太长而感到不安。除非你强制操作系统立即写入,比如每次写入或在O_SYNC模式下打开文件时关闭文件。



    如果操作系统没有做正确的事情,可以尝试提高缓冲区大小(第三个参数为 open())。有关100MB / s 10ms延迟IO系统的适当值的一些指导,1MB IO大小将导致大约50%的延迟开销,而10MB IO大小将导致9%的开销。如果它仍然IO绑定,你可能只需要更多的带宽。使用你的操作系统特定的工具来检查你到/从磁盘获得什么样的带宽。

    另外有用的是检查步骤4是否花了很多时间执行或者在IO上等待。如果它正在执行,则需要花费更多的时间来检查哪个部分是罪魁祸首,并优化它,或者将工作分解到不同的进程。

    I have a python program that does something like this:

    1. Read a row from a csv file.
    2. Do some transformations on it.
    3. Break it up into the actual rows as they would be written to the database.
    4. Write those rows to individual csv files.
    5. Go back to step 1 unless the file has been totally read.
    6. Run SQL*Loader and load those files into the database.

    Step 6 isn't really taking much time at all. It seems to be step 4 that's taking up most of the time. For the most part, I'd like to optimize this for handling a set of records in the low millions running on a quad-core server with a RAID setup of some kind.

    There are a few ideas that I have to solve this:

    1. Read the entire file from step one (or at least read it in very large chunks) and write the file to disk as a whole or in very large chunks. The idea being that the hard disk would spend less time going back and forth between files. Would this do anything that buffering wouldn't?
    2. Parallelize steps 1, 2&3, and 4 into separate processes. This would make steps 1, 2, and 3 not have to wait on 4 to complete.
    3. Break the load file up into separate chunks and process them in parallel. The rows don't need to be handled in any sequential order. This would likely need to be combined with step 2 somehow.

    Of course, the correct answer to this question is "do what you find to be the fastest by testing." However, I'm mainly trying to get an idea of where I should spend my time first. Does anyone with more experience in these matters have any advice?

    解决方案

    Python already does IO buffering and the OS should handle both prefetching the input file and delaying writes until it needs the RAM for something else or just gets uneasy about having dirty data in RAM for too long. Unless you force the OS to write them immediately, like closing the file after each write or opening the file in O_SYNC mode.

    If the OS isn't doing the right thing, you can try raising the buffer size (third parameter to open()). For some guidance on appropriate values given a 100MB/s 10ms latency IO system a 1MB IO size will result in approximately 50% latency overhead, while a 10MB IO size will result in 9% overhead. If its still IO bound, you probably just need more bandwidth. Use your OS specific tools to check what kind of bandwidth you are getting to/from the disks.

    Also useful is to check if step 4 is taking a lot of time executing or waiting on IO. If it's executing you'll need to spend more time checking which part is the culprit and optimize that, or split out the work to different processes.

    这篇关于我应该如何优化这个文件系统的I / O绑定程序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆