使用sendfile()以线程或其他高效的复制文件方法复制文件 [英] Use sendfile() to copy file with threads or other efficient copy file method
问题描述
我尝试使用Linux系统调用 sendfile()
使用线程复制文件。
I'm trying to use the Linux system call sendfile()
to copy a file using threads.
我有兴趣优化代码的这些部分:
I'm interested in optimizing these parts of the code:
fseek(fin, size * (number) / MAX_THREADS, SEEK_SET);
fseek(fout, size * (number) / MAX_THREADS, SEEK_SET);
/* ... */
fwrite(buff, 1, len, fout);
代码:
void* FileOperate::FileCpThread::threadCp(void *param)
{
Info *ft = (Info *)param;
FILE *fin = fopen(ft->fromfile, "r+");
FILE *fout = fopen(ft->tofile, "w+");
int size = getFileSize(ft->fromfile);
int number = ft->num;
fseek(fin, size * (number) / MAX_THREADS, SEEK_SET);
fseek(fout, size * (number) / MAX_THREADS, SEEK_SET);
char buff[1024] = {'\0'};
int len = 0;
int total = 0;
while((len = fread(buff, 1, sizeof(buff), fin)) > 0)
{
fwrite(buff, 1, len, fout);
total += len;
if(total > size/MAX_THREADS)
{
break;
}
}
fclose(fin);
fclose(fout);
}
推荐答案
文件复制不受CPU限制;如果你可能发现限制是在内核级别,你无法在用户级别上并行化。
File copying is not CPU bound; if it were you're likely to find that the limitation is at the kernel level and nothing you can do at the user leve would parallelize it.
这样的改进机械驱动器实际上将降低吞吐量。
Such "improvements" done on mechanical drives will in fact degrade the throughput. You're wasting time seeking along the file instead of reading and writing it.
如果文件很长,并且不希望随时需要读取或写入数据,很快,打开时可能会使用 O_DIRECT
标志。这是一个坏主意,因为 O_DIRECT
API本质上是
If the file is long and you don't expect to need the read or written data anytime soon, it might be tempting to use the O_DIRECT
flag on open. That's a bad idea, since the O_DIRECT
API is essentially broken by design.
而是应该在源文件和目标文件上使用 posix_fadvise
与POSIX_FADV_SEQUENTIAL和POSIX_FADV_NOREUSE标志。在write(或sendfile)调用完成后,您需要建议不再需要数据 - 传递POSIX_FADV_DONTNEED。这样,页面缓存只用于保持数据流动所需的程度,并且一旦数据被消耗(写入磁盘),页面就会被回收。
Instead, you should use posix_fadvise
on both source and destination files, with POSIX_FADV_SEQUENTIAL and POSIX_FADV_NOREUSE flags. After the write (or sendfile) call is finished, you need to advise that the data is not needed anymore - pass POSIX_FADV_DONTNEED. That way the page cache will only be used to the extent needed to keep the data flowing, and the pages will be recycled as soon as the data has been consumed (written to disk).
sendfile
不会将文件数据推送到用户空间,因此进一步放宽了内存和处理器缓存的一些压力。这是关于复制非设备特定文件的唯一其他明显改进。
The sendfile
will not push file data over to the user space, so it further relaxes some of the pressure from memory and processor cache. That's about the only other sensible improvement you can make for copying of files that's not device-specific.
选择合理的块大小也是可取的。考虑到现代驱动器推送超过100Mbytes / s,您可能希望一次推送一个兆字节,并且总是4096字节页面大小的倍数,因此(4096 * 256)
是在单个 sendfile
或读取
/ 中处理的正常起始块大小
Choosing a sensible chunk size is also desirable. Given that modern drives push over a 100Mbytes/s, you might want to push a megabyte at a time, and always a multiple of the 4096 byte page size - thus (4096*256)
is a decent starting chunk size to handle in a single sendfile
or read
/write
calls.
根据您的建议读取并行化仅对RAID 0卷有意义,并且只有当输入和输出文件跨接时才有意义物理磁盘。然后,您可以按照由文件跨接的源卷和目标卷物理磁盘数量的较小值来拥有一个线程。这只有在你不使用异步文件I / O时才需要。使用异步I / O,你不需要多于一个线程,尤其是如果块大小很大(兆字节+),单线程的延迟惩罚是可以忽略的。
Read parallelization, as you propose it, only makes sense on RAID 0 volumes, and only when both the input and output files straddle the physical disks. You can then have one thread per the lesser of the number of source and destination volume physical disks straddled by the file. That's only necessary if you're not using asynchronous file I/O. With async I/O you wouldn't need more than one thread anyway, especially not if the chunk sizes are large (megabyte+) and the single-thread latency penalty is negligible.
对于SSD上的单个文件副本的并行化没有任何意义,除非你确实在一些非常奇怪的系统上。
There's no sense for parallelization of a single file copy on SSDs, unless you were on some very odd system indeed.
这篇关于使用sendfile()以线程或其他高效的复制文件方法复制文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!