与多处理一起使用时,PyTesseract调用的工作非常缓慢 [英] PyTesseract call working very slow when used along with multiprocessing
问题描述
我有一个函数,可以在将OCR应用于图像后,获取图像列表并在列表中生成输出.我还有另一个功能,可以使用多重处理来控制此功能的输入.因此,当我只有一个列表(即没有多重处理)时,列表中的每个图像大约要花费1s,但是当我将必须并行处理的列表增加到4时,每个图像却要花费惊人的13s.
I've a function that takes in a list of images and produces the output, in a list, after applying OCR to the image. I have an another function that controls the input to this function, by using multiprocessing. So, when I have a single list (i.e. no multiprocessing), each image of the list took ~ 1s, but when I increased the lists that had to be processed parallely to 4, each image took an astounding 13s.
要了解问题的真正出处,我尝试创建该问题的最小限度的工作示例.在这里,我有两个函数eat25
和eat100
,它们打开图像name
并将其提供给使用API pytesseract
的OCR. eat25
执行25次,eat100
执行100次.
To understand where the problem really is, I tried to create a minimal working example of the problem. Here I have two functions eat25
and eat100
which open an image name
and feed it to the OCR, that uses the API pytesseract
. eat25
does it 25 times, and eat100
does it 100 times.
我的目的是在没有多处理的情况下运行eat100
,在有多处理(具有4个进程)的情况下运行eat25
.从理论上讲,如果我有4个单独的处理器(我有2个内核,每个内核有2个线程,则CPU = 4(如果我错了,请更正我)),这应该比eat100
少4倍的时间.
My aim here is to run eat100
without multiprocessing, and eat25
with multiprocessing (with 4 processes). This, theoretically, should take 4 times less time that eat100
if I have 4 separate processors (I have 2 cores with 2 threads per core, thus CPU(s) = 4 (correct me if I'm wrong here)).
但是当我看到代码甚至在打印4次正在处理0"之后都没有响应时,所有的理论都浪费了.单处理器功能eat100
可以正常工作.
But all theory laid wasted when I saw that the code didn't even respond after printing "Processing 0" 4 times. The single processor function eat100
worked fine though.
我已经测试了一个简单的范围求值功能,并且它在多处理中确实能很好地工作,因此我的处理器肯定能很好地工作.唯一的罪魁祸首可能是:
I had tested a simple range cubing function, and it did work well with multiprocessing, so my processors do work well for sure. The only culprits here could be:
-
pytesseract
:请参见此 - 错误代码?我做错了什么.
pytesseract
: See this- Bad code? Something I am not doing right.
`
from pathos.multiprocessing import ProcessingPool
from time import time
from PIL import Image
import pytesseract as pt
def eat25(name):
for i in range(25):
print('Processing :'+str(i))
pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
def eat100(name):
for i in range(100):
print('Processing :'+str(i))
pt.image_to_string(Image.open(name),lang='hin+eng',config='--psm 6')
st = time()
eat100('normalBox.tiff')
en = time()
print('Direct :'+str(en-st))
#Using pathos
def caller():
pool = ProcessingPool()
pool.map(eat25,['normalBox.tiff','normalBox.tiff','normalBox.tiff','normalBox.tiff'])
if (__name__=='__main__'):
caller()
en2 = time()
print('Pathos :'+str(en2-en))
那么,问题出在哪里呢?感谢您的帮助!
So, where the problem really is? Any help is appreciated!
可以在此处中找到图像normalBox.tiff
.如果人们能重现代码并检查问题是否仍然存在,我将感到很高兴.
The image normalBox.tiff
can be found here. I would be glad if people reproduce the code and check if the problem continues.
推荐答案
我是pathos
的作者.如果您的代码需要1s
来串行运行,那么很可能在幼稚的进程中并行运行会花费更长的时间.处理朴素的进程并行会产生开销:
I'm thepathos
author. If your code takes 1s
to run serially, then it's quite possible that it will take longer to run in naive process parallel. There is overhead to working with naive process parallel:
- 必须在每个处理器上旋转一个新的python实例
- 您的函数和依赖项需要序列化并发送到每个处理器
- 您的数据需要序列化并发送到处理器
- 反序列化相同
- 您可能会从长寿命池或大量数据序列化中遇到内存问题.
我建议您检查一些简单的问题,以检查您的问题可能在哪里:
I'd suggest checking a few simple things to check where your issues might be:
- 尝试
pathos.pools.ThreadPool
使用线程并行而不是进程并行.这样可以减少序列化和扩展池的一些开销. - 尝试使用
pathos.pools._ProcessPool
更改pathos
管理池的方式.如果没有下划线,pathos
会将池保持为单例状态,并且需要使用终止符"来明确终止池.如果使用下划线,则删除池对象时池会死掉.请注意,您的caller
函数不是close
或join
(或terminate
)池. - 您可能想通过尝试
dill.dumps
您要并行处理的元素之一来检查要序列化的数量.大型numpy
阵列之类的内容可能需要一段时间才能序列化.如果要传递的内容很大,则可以考虑使用共享内存数组(即multiprocess.Array
或numpy
数组的等效版本-另请参见:numpy.ctypeslib
)以最大程度地减少传递的内容每个过程之间.
- try the
pathos.pools.ThreadPool
to use thread parallel instead of process parallel. This can reduce some of the overhead for serialization and spinning up the pool. - try the
pathos.pools._ProcessPool
to change howpathos
manages the pool. Without the underscore,pathos
keeps the pool around as a singleton, and requires a 'terminate' to explicitly kill the pool. With the underscore, the pool dies when you delete the pool object. Note that yourcaller
function does notclose
orjoin
(orterminate
) the pool. - you might want to check how much you are serializing by trying to
dill.dumps
one of the elements you are trying to process in parallel. Things like bignumpy
arrays can take a while to serialize. If the size of what is being passed around is large, you might consider using a shared memory array (i.e. amultiprocess.Array
or the equivalent version fornumpy
arrays -- also see:numpy.ctypeslib
) to minimize what is being passed between each process.
后者需要更多工作,但是如果您有很多需要序列化的内容,则可以节省大量资金.没有共享的内存池,因此如果需要执行该操作,则必须对各个multiprocess.Process
对象执行一个for循环.
The latter is a bit more work, but can provide huge savings if you have a lot to serialize. There is no shared memory pool, so you have to do a for loop over the individual multiprocess.Process
objects if you need to go that route.
这篇关于与多处理一起使用时,PyTesseract调用的工作非常缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!