优化许多文件的并行处理 [英] Optimizing parallel processing of many files

查看:114
本文介绍了优化许多文件的并行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序处理大量文件,每个文件需要完成两件事:首先,读取并处理一些文件,然后生成 MyFileData 被存储。第一部分可以并行化,第二部分不能。

I have a piece of program processing a lot of files, where for each files two things needs to be done: First, some piece of the file is read and processed, and then the resulting MyFileData gets stored. The first part can be parallelized, the second can not.

按顺序执行所有操作非常慢,因为CPU必须等待磁盘,然后它会工作一点,然后又发出另一个请求,然后再等待...

Doing everything sequentially is very slow, as the CPU has to wait for the disk, then it works a bit, and then it issues another request, and waits again...

我做了以下

class MyCallable implements Callable<MyFileData> {
    MyCallable(File file) {
        this.file = file;
    }
    public MyFileData call() {
        return someSlowOperation(file);
    }
    private final File file;
}

for (File f : files) futures.add(executorService.submit(new MyCallable(f)));
for (Future<MyFileData> f : futures) sequentialOperation(f.get());

并且帮助很大。但是,我想改进两件事:

and it helped a lot. However, I'd like to improve two things:


  • sequentialOperation 以固定顺序执行,而不是先处理可用的结果。如何更改?

  • The sequentialOperation gets executed in a fixed order instead of processing whatever result is available first. How can I change it?

需要处理数千个文件,启动数千个磁盘请求可能会导致磁盘丢失。通过使用 Executors.newFixedThreadPool(10)我限制了这个数字,但是我正在寻找更好的东西。理想情况下,它应该是自我调整的,以便它在不同的计算机上运行最佳(例如,在 RAID <时发出更多请求) / a>和/或 NCQ 可用,等等。我不认为它可能基于找到硬件配置,但测量处理速度和基于它的优化应该以某种方式是可能的。有什么想法吗?

There are thousands of files to be processed and starting thousands of disk requests could lead to disk trashing. By using Executors.newFixedThreadPool(10) I've limited this number, however I'm looking for something better. Ideally it should be self-tuning, so that it works optimal on different computers (e.g., issues more requests when RAID and/or NCQ is available, etc.). I don't think it could be based on finding out the HW configuration, but measuring the processing speed and optimizing based on it should somehow be possible. Any idea?

推荐答案


sequentialOperation获取以固定顺序执行,而不是先处理可用的结果。如何更改?

The sequentialOperation gets executed in a fixed order instead of processing whatever result is available first. How can I change it?

这正是 CompletionService 执行:它并行处理任务,并在完成后返回它们,无论提交顺序如何。

That's exactly what a CompletionService does: it processes the tasks in parallel and returns them as they get completed, regardless of the submission order.

简化(未测试)示例:

int NUM_THREADS = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(NUM_THREADS);
CompletionService<MyFileData> completionService = new ExecutorCompletionService<MyFileData>(executor);

for (File f : files) futures.add(completionService.submit(new MyCallable(f)));

for(int i = 0; i < futures.size(); i++) {
    Future<MyFileData> next = completionService.take();
    sequentialOperation(next.get());
}




有数千个文件需要处理和启动数千个磁盘请求可能会导致磁盘丢失。通过使用Executors.newFixedThreadPool(10)我限制了这个数字,但是我正在寻找更好的东西。

There are thousands of files to be processed and starting thousands of disk requests could lead to disk trashing. By using Executors.newFixedThreadPool(10) I've limited this number, however I'm looking for something better.

我是不是100%肯定那个。我想这取决于你有多少磁盘,但我认为磁盘访问部分不应该分成太多的线程(每个磁盘一个线程可能是明智的):如果许多线程同时访问一个磁盘,它将花费更多的时间而不是阅读。

I'm not 100% sure on that one. I suppose it depends on how many disks you have, but I would have thought that the disk access part should not be split in too many threads (one thread per disk would probably be sensible): if many threads access one disk at the same time, it will spend more time seeking than reading.

这篇关于优化许多文件的并行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆