与多处理一起使用时,PyTesseract调用的工作非常缓慢 [英] PyTesseract call working very slow when used along with multiprocessing

查看:289
本文介绍了与多处理一起使用时,PyTesseract调用的工作非常缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个函数,可以在将OCR应用于图像后,获取图像列表并在列表中生成输出.我还有另一个功能,可以使用多重处理来控制此功能的输入.因此,当我只有一个列表(即没有多重处理)时,列表中的每个图像大约要花费1s,但是当我将必须并行处理的列表增加到4时,每个图像却要花费惊人的13s.

I've a function that takes in a list of images and produces the output, in a list, after applying OCR to the image. I have an another function that controls the input to this function, by using multiprocessing. So, when I have a single list (i.e. no multiprocessing), each image of the list took ~ 1s, but when I increased the lists that had to be processed parallely to 4, each image took an astounding 13s.

要了解问题的真正出处,我尝试创建该问题的最小限度的工作示例.在这里,我有两个函数eat25eat100,它们打开图像name并将其提供给使用API​​ pytesseract的OCR. eat25执行25次,eat100执行100次.

To understand where the problem really is, I tried to create a minimal working example of the problem. Here I have two functions eat25 and eat100 which open an image name and feed it to the OCR, that uses the API pytesseract. eat25 does it 25 times, and eat100 does it 100 times.

我的目的是在没有多处理的情况下运行eat100,在有多处理(具有4个进程)的情况下运行eat25.从理论上讲,如果我有4个单独的处理器(我有2个内核,每个内核有2个线程,则CPU = 4(如果我错了,请更正我)),这应该比eat100少4倍的时间.

My aim here is to run eat100 without multiprocessing, and eat25 with multiprocessing (with 4 processes). This, theoretically, should take 4 times less time that eat100 if I have 4 separate processors (I have 2 cores with 2 threads per core, thus CPU(s) = 4 (correct me if I'm wrong here)).

但是当我看到代码甚至在打印4次正在处理0"之后都没有响应时,所有的理论都浪费了.单处理器功能eat100可以正常工作.

But all theory laid wasted when I saw that the code didn't even respond after printing "Processing 0" 4 times. The single processor function eat100 worked fine though.

我已经测试了一个简单的范围求值功能,并且它在多处理中确实能很好地工作,因此我的处理器肯定能很好地工作.唯一的罪魁祸首可能是:

I had tested a simple range cubing function, and it did work well with multiprocessing, so my processors do work well for sure. The only culprits here could be:

  • pytesseract:请参见
  • 错误代码?我做错了什么.
  • pytesseract: See this
  • Bad code? Something I am not doing right.

`

from pathos.multiprocessing import ProcessingPool
from time import time 
from PIL import Image
import pytesseract as pt
def eat25(name):
    for i in range(25):
        print('Processing :'+str(i))
        pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
def eat100(name):
    for i in range(100):
        print('Processing :'+str(i))
        pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
st = time()
eat100('normalBox.tiff')
en = time()
print('Direct :'+str(en-st))
#Using pathos
def caller():
    pool = ProcessingPool()
    pool.map(eat25,['normalBox.tiff','normalBox.tiff','normalBox.tiff','normalBox.tiff'])
if (__name__=='__main__'):
    caller()
en2 = time()

print('Pathos :'+str(en2-en))

那么,问题出在哪里呢?感谢您的帮助!

So, where the problem really is? Any help is appreciated!

可以在此处中找到图像normalBox.tiff.如果人们能重现代码并检查问题是否仍然存在,我将感到很高兴.

The image normalBox.tiff can be found here. I would be glad if people reproduce the code and check if the problem continues.

推荐答案

我是pathos的作者.如果您的代码需要1s来串行运行,那么很可能在幼稚的进程中并行运行会花费更长的时间.处理朴素的进程并行会产生开销:

I'm thepathos author. If your code takes 1s to run serially, then it's quite possible that it will take longer to run in naive process parallel. There is overhead to working with naive process parallel:

  1. 必须在每个处理器上旋转一个新的python实例
  2. 您的函数和依赖项需要序列化并发送到每个处理器
  3. 您的数据需要序列化并发送到处理器
  4. 反序列化相同
  5. 您可能会从长寿命池或大量数据序列化中遇到内存问题.

我建议您检查一些简单的问题,以检查您的问题可能在哪里:

I'd suggest checking a few simple things to check where your issues might be:

  • 尝试pathos.pools.ThreadPool使用线程并行而不是进程并行.这样可以减少序列化和扩展池的一些开销.
  • 尝试使用pathos.pools._ProcessPool更改pathos管理池的方式.如果没有下划线,pathos会将池保持为单例状态,并且需要使用终止符"来明确终止池.如果使用下划线,则删除池对象时池会死掉.请注意,您的caller函数不是closejoin(或terminate)池.
  • 您可能想通过尝试dill.dumps您要并行处理的元素之一来检查要序列化的数量.大型numpy阵列之类的内容可能需要一段时间才能序列化.如果要传递的内容很大,则可以考虑使用共享内存数组(即multiprocess.Arraynumpy数组的等效版本-另请参见:numpy.ctypeslib)以最大程度地减少传递的内容每个过程之间.
  • try the pathos.pools.ThreadPool to use thread parallel instead of process parallel. This can reduce some of the overhead for serialization and spinning up the pool.
  • try the pathos.pools._ProcessPool to change how pathos manages the pool. Without the underscore, pathos keeps the pool around as a singleton, and requires a 'terminate' to explicitly kill the pool. With the underscore, the pool dies when you delete the pool object. Note that your caller function does not close or join (or terminate) the pool.
  • you might want to check how much you are serializing by trying to dill.dumps one of the elements you are trying to process in parallel. Things like big numpy arrays can take a while to serialize. If the size of what is being passed around is large, you might consider using a shared memory array (i.e. a multiprocess.Array or the equivalent version for numpy arrays -- also see: numpy.ctypeslib) to minimize what is being passed between each process.

后者需要更多工作,但是如果您有很多需要序列化的内容,则可以节省大量资金.没有共享的内存池,因此如果需要执行该操作,则必须对各个multiprocess.Process对象执行一个for循环.

The latter is a bit more work, but can provide huge savings if you have a lot to serialize. There is no shared memory pool, so you have to do a for loop over the individual multiprocess.Process objects if you need to go that route.

这篇关于与多处理一起使用时,PyTesseract调用的工作非常缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆