体积图像数据的分布式处理 [英] Distributed Processing of Volumetric Image Data

查看:150
本文介绍了体积图像数据的分布式处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于物体识别算法的开发,我需要在大量的体积图像文件(MR扫描)上重复运行检测程序。
检测程序是一个命令行工具。如果我在本地计算机上在单个文件上运行它,并且单线程大约需要10秒。处理结果被写入文本文件。
典型的运行方式是:


  • 每个300 MB的10000张图像= 3TB

  • 10秒钟在单个核心= 100000秒=大约27小时


我能做些什么来获得更快的结果?我可以访问20个具有24个(虚拟)内核的服务器(Xeon E5,1TByte磁盘,CentOS Linux 7.2)的集群。
理论上480个内核应该只需要3.5分钟的任务。
我正在考虑使用Hadoop,但它不是用来处理二进制数据,而是分割输入文件,这不是一个选项。
我可能需要某种分布式文件系统。我使用NFS进行测试,网络成为严重的瓶颈。每个服务器只应处理其本地存储的文件。
另一种方法可能是购买一台高端工作站,并忘记分布式处理。

我不确定,如果我们需要数据局部性,
即每个节点在本地HD上保存部分数据,并仅处理其
本地数据。

解决方案

你可以使用Hadoop。是的,FileInputFormat和RecordReader的默认实现是将文件拆分成块并将块拆分成行,但您可以编写自己的FileInputFormat和RecordReader实现。为了另一个目的,我创建了自定义的FileInputFormat,我遇到了相反的问题 - 将输入数据分割得比默认的更精细,但对于您的问题有一个很好看的食谱: https://gist.github.com/sritchie/808035 https://www.timofejew.com/hadoop-streaming-whole-files/

但是从另一面Hadoop是一个沉重的野兽。它对映射器启动有很大的开销,所以映射器的最佳运行时间是几分钟。你的任务太短。也许可以创建更聪明的FileInputFormat,它可以将一堆文件解释为单个文件,并将文件作为记录解析为同一个映射器,我不确定。


For the development of an object recognition algorithm, I need to repeatedly run a detection program on a large set of volumetric image files (MR scans). The detection program is a command line tool. If I run it on my local computer on a single file and single-threaded it takes about 10 seconds. Processing results are written to a text file. A typical run would be:

  • 10000 images with 300 MB each = 3TB
  • 10 seconds on a single core = 100000 seconds = about 27 hours

What can I do to get the results faster? I have access to a cluster of 20 servers with 24 (virtual) cores each (Xeon E5, 1TByte disks, CentOS Linux 7.2). Theoretically the 480 cores should only need 3.5 minutes for the task. I am considering to use Hadoop, but it's not designed for processing binary data and it splits input files, which is not an option. I probably need some kind of distributed file system. I tested using NFS and the network becomes a serious bottleneck. Each server should only process his locally stored files. The alternative might be to buy a single high-end workstation and forget about distributed processing.

I am not certain, if we need data locality, i.e. each node holds part of the data on a local HD and processes only his local data.

解决方案

You can use Hadoop. Yes, default implementation of FileInputFormat and RecordReader are splitting files into chunks and split chunks into lines, but you can write own implementation of FileInputFormat and RecordReader. I've created custom FileInputFormat for another purpose, I had opposite problem - to split input data more finely than default, but there is a good looking recipes for exactly your problem: https://gist.github.com/sritchie/808035 plus https://www.timofejew.com/hadoop-streaming-whole-files/

But from other side Hadoop is a heavy beast. It has significant overhead for mapper start, so optimal running time for mapper is a few minutes. Your tasks are too short. Maybe it is possible to create more clever FileInputFormat which can interpret bunch of files as single file and feed files as records to the same mapper, I'm not sure.

这篇关于体积图像数据的分布式处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆