如何在Python多处理中的所有进程之间共享数据? [英] How to share data between all process in Python multiprocessing?
问题描述
我想搜索给定文章中关键字的预定义列表,如果在文章中找到关键字,则将分数加1.我想使用多重处理,因为关键字的预定义列表非常大-关键字10k,文章数100k.
I want to search for pre-defined list of keywords in a given article and increment the score by 1 if keyword is found in article. I want to use multiprocessing since pre-defined list of keyword is very large - 10k keywords and number of article is 100k.
我遇到了这个问题,但它没有解决我的问题问题.
I came across this question but it does not address my question.
我尝试了此实现,但结果为None
.
I tried this implementation but getting None
as result.
keywords = ["threading", "package", "parallelize"]
def search_worker(keyword):
score = 0
article = """
The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
if keyword in article:
score += 1
return score
我尝试了以下两种方法,但得到了三个None
.
I tried below two method but getting three None
as result.
方法1:
pool = mp.Pool(processes=4)
result = [pool.apply(search_worker, args=(keyword,)) for keyword in keywords]
方法2:
result = pool.map(search_worker, keywords)
print(result)
实际输出: [无,无,无]
预期输出: 3
我考虑将工人的预定义关键字列表和文章一起发送给工作人员,但是由于我没有多处理的经验,因此我不确定我是否朝着正确的方向前进.
I think of sending the worker the pre-defined list of keyword and the article all together, but I am not sure if I am going in right direction as I don't have prior experience of multiprocessing.
谢谢.
推荐答案
这是使用Pool
的函数.您可以传递text和keyword_list,它将起作用.您可以使用Pool.starmap
传递(text, keyword)
的元组,但是您需要处理一个对text
具有10k引用的可迭代对象.
Here's a function using Pool
. You can pass text and keyword_list and it will work. You could use Pool.starmap
to pass tuples of (text, keyword)
, but you would need to deal with an iterable that had 10k references to text
.
from functools import partial
from multiprocessing import Pool
def search_worker(text, keyword):
return int(keyword in text)
def parallel_search_text(text, keyword_list):
processes = 4
chunk_size = 10
total = 0
func = partial(search_worker, text)
with Pool(processes=processes) as pool:
for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
total += result
return total
if __name__ == '__main__':
texts = [] # a list of texts
keywords = [] # a list of keywords
for text in texts:
print(parallel_search_text(text, keywords))
创建工作人员池有开销.可能值得对一个简单的单进程文本搜索功能进行时间测试.通过创建Pool
的一个实例并将其传递给函数,可以加快重复调用的速度.
There is overhead in creating a pool of workers. It might be worth timeit-testing this against a simple single-process text search function. Repeat calls can be sped up by creating one instance of Pool
and passing it into the function.
def parallel_search_text2(text, keyword_list, pool):
chunk_size = 10
results = 0
func = partial(search_worker, text)
for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
results += result
return results
if __name__ == '__main__':
pool = Pool(processes=4)
texts = [] # a list of texts
keywords = [] # a list of keywords
for text in texts:
print(parallel_search_text2(text, keywords, pool))
这篇关于如何在Python多处理中的所有进程之间共享数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!