在python中异步读取和处理图像 [英] Asynchronously read and process an image in python

查看:21
本文介绍了在python中异步读取和处理图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景

我经常发现自己处于以下情况:

I often found myself in the following situation:

  • 我有一个需要处理的图像文件名列表
  • 我使用例如 scipy.misc.imread 顺序读取每个图像
  • 然后我对每个图像进行某种处理并返回结果
  • 我将结果沿图像文件名保存到一个 Shelf 中

问题在于,简单地读取图像所花费的时间不可忽略,有时与图像处理相当甚至更长.

The problem is that simply reading the image takes a non negligible amount of time, sometime comparable or even longer than the image processing.

问题

所以我想理想情况下我可以在处理图像 n 的同时读取图像 n + 1.或者甚至更好地以自动确定的最佳方式一次处理和读取多个图像?

So I was thinking that ideally I could read image n + 1 while processing image n. Or even better processing and reading multiple images at once in an automagically determined optimal way ?

我已经阅读了关于多处理、线程、扭曲、gevent 等的内容,但我不知道该使用哪一个以及如何实现这个想法.有没有人有解决这种问题的办法?

I have read about multiprocessing, threads, twisted, gevent and the like but I can't figure out which one to use and how to implement this idea. Does anyone have a solution to this kind of issue ?

最小示例

# generate a list of images
scipy.misc.imsave("lena.png", scipy.misc.lena())
files = ['lena.png'] * 100

# a simple image processing task
def process_image(im, threshold=128):
    label, n = scipy.ndimage.label(im > threshold)
    return n

# my current main loop
for f in files:
    im = scipy.misc.imread(f)
    print process_image(im)

推荐答案

Philip's answer 很好,但只会创建几个进程(一个读取,一个计算)几乎不会最大化现代 > 2 核心系统.这是使用 multiprocessing.Pool(特别是它的 map 方法),它创建的进程同时进行读取和计算,但应该更好地利用您拥有的所有内核(假设文件比内核多).

Philip's answer is good, but will only create a couple of processes (one reading, one computing) which will hardly max out a modern >2 core system. Here's an alternative using multiprocessing.Pool (specifically, its map method) which creates processes which do both the reading and compute aspects, but which should make better use of all the cores you have available (assuming there are more files than cores).

#!/usr/bin/env python

import multiprocessing
import scipy
import scipy.misc
import scipy.ndimage

class Processor:
    def __init__(self,threshold):
        self._threshold=threshold

    def __call__(self,filename):
        im = scipy.misc.imread(filename)
        label,n = scipy.ndimage.label(im > self._threshold)
        return n

def main():
    scipy.misc.imsave("lena.png", scipy.misc.lena())
    files = ['lena.png'] * 100

    proc=Processor(128)
    pool=multiprocessing.Pool()
    results=pool.map(proc,files)

    print results

if __name__ == "__main__":
    main()

如果我将图像数量增加到 500,并使用 processes=N 参数给 Pool,那么我得到

If I increase the number of images to 500, and use the processes=N argument to Pool, then I get

Processes   Runtime
   1         6.2s
   2         3.2s
   4         1.8s
   8         1.5s

在我的四核超线程 i7 上.

on my quad-core hyperthreaded i7.

如果您进入更现实的用例(即实际不同的图像),您的进程可能会花费更多时间等待图像数据从存储加载(在我的测试中,它们几乎立即从缓存磁盘加载),然后可能值得明确创建比内核更多的进程,以获得更多的计算和负载重叠.不过,只有您自己在实际负载和硬件上进行的可扩展性测试才能告诉您什么才是最适合您的.

If you got into more realistic use-cases (ie actual different images), your processes might be spending more time waiting on the image data to load from storage (in my testing, they load virtually instantaneously from cached disk) and then it might be worth explicitly creating more processes than cores to get some more overlap of compute and load. Only your own scalability testing on a realistic load and HW can tell you what's actually best for you though.

这篇关于在python中异步读取和处理图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆