是否可以使用线程来加速文件读取? [英] Is it possible to use threads to speed up file reading?

查看:180
本文介绍了是否可以使用线程来加速文件读取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想尽快读取一个文件(40k行)。

I want to read a file as fast as possible (40k lines) .

编辑:Andres Jaan Tack建议一个基于解决方案在每个文件的一个线程,我想确定我有这个(因此这是最快的方式):

Andres Jaan Tack suggested a solution based on one thread per file, and I want to be sure I got this (thus this is the fastest way) :



  • 一个线程计算由输入线程读取的每个单元格的线性组合,并将结果存储在退出容器中(与输出文件相关联)。

  • 一个线程按块(每4kB数据,大约10行)写入输出容器的内容。

我应该推断我不能使用m映射文件(因为程序在等待等待数据)?

Should I deduce that I must not use m-mapped files (because the program's on standby waiting for the data) ?

感谢之前。

此致

先生mystère。

推荐答案

当你进一步询问时,你的问题有点深入。

Your question got a little bit deeper, when you asked further. I'll try to cover all your options...

使用一个线程。

Use one thread.

如果从一个线程前后直接读取文件,操作系统将不会像你在想的那样以小块获取文件。相反,它会以巨大的(指数增长的)块在你之前预取该文件,所以你几乎不会为去磁盘付出代价。你可能会等待磁盘几次,但一般来说,这将像文件已经在内存中,这甚至与 mmap 无关。

If you read straight through a file front-to-back from a single thread, the operating system will not fetch the file in small chunks like you're thinking. Rather, it will prefetch the file ahead of you in huge (exponentially growing) chunks, so you almost never pay a penalty for going to disk. You might wait for the disk a handful of times, but in general it will be like the file was already in memory, and this is even irrespective of mmap.

操作系统在这种顺序文件读取中是非常好的,因为它是可预测的。当你从多个线程中读取一个文件时,你基本上是随机阅读这显然是不可预测的。预取程序往往在随机读取中效率低得多。

The OS is very good at this kind of sequential file reading, because it's predictable. When you read a file from multiple threads, you're essentially reading randomly, which is (obviously) less predictable. Prefetchers tend to be much less effective with random reads, in this case probably making the whole application slower instead of faster.

注意:这是甚至在添加设置线程和所有其余的线程的成本之前。

Notice: This is even before you add the cost of setting up the threads and all the rest of it. That costs something, too, but it's basically nothing compared with the cost of more blocking disk accesses.

使用与您有文件(或一些合理的数字)一样多的线程。

Use as many threads as you have files (or some reasonable number).

为每个打开的文件单独完成文件预取。一旦你开始阅读多个文件,你应该从几个并行阅读。这样做是因为磁盘 I / O调度程序会尝试找出读取所有文件的最快顺序通常,在操作系统和硬盘驱动器本身都有磁盘调度程序。同时,预取程序仍然可以完成它的工作。

File prefetching done separately for each open file. Once you start reading multiple files, you should read from several of them in parallel. This works because the disk I/O Scheduler will try to figure out the fastest order in which to read all of them in. Often, there's a disk scheduler both in the OS and on the hard drive itself. Meanwhile, the prefetcher can still do its job.

并行读取几个文件比一个一个读取文件总是更好。如果你一次读一个,你的磁盘将在预取之间空闲;这是将更多的数据读入内存的宝贵时间!你唯一可以出错的方法是如果你有太少的RAM支持许多打开的文件;这不常见。

Reading several files in parallel is always better than reading the files one-by-one. If you did read them one at a time, your disk would idle between prefetches; that's valuable time to read more data into memory! The only way you can go wrong is if you have too little RAM to support many open files; that's not common, anymore.

注意:如果你对多个文件读取过于热情,读取一个文件将开始踢其他文件

A word of caution: If you're too overzealous with your multiple file reads, reading one file will start kicking bits of other files out of memory, and you're back to a random-read situation.

从多个线程处理和产生输出可能工作,但它取决于你需要如何组合它们。在任何情况下,你都必须小心如何同步线程,虽然肯定有一些相对容易的无锁方法来做到这一点。

Processing and producing output from multiple threads might work, but it depends how you need to combine them. You'll have to be careful about how you synchronize the threads, in any case, though there are surely some relatively easy lock-less ways to do that.

一件事寻找,虽然:不要打扰以小(< 4K)块写入文件。在调用 write()之前,一次收集至少4K个数据。此外,由于内核会在写入文件时锁定文件,因此不要从所有线程中调用 write();他们将互相等待,而不是处理更多的数据。

One thing to look for, though: Don't bother writing the file in small (< 4K) blocks. Collect at least 4K of data at a time before you call write(). Also, since the kernel will lock the file when you write it, don't call write() from all of your threads together; they'll all wait for each other instead of processing more data.

这篇关于是否可以使用线程来加速文件读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆