如何在Python多处理中的所有进程之间共享数据? [英] How to share data between all process in Python multiprocessing?

查看:85
本文介绍了如何在Python多处理中的所有进程之间共享数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想搜索给定文章中关键字的预定义列表,如果在文章中找到关键字,则将分数加1.我想使用多重处理,因为关键字的预定义列表非常大-关键字10k,文章数100k.

I want to search for pre-defined list of keywords in a given article and increment the score by 1 if keyword is found in article. I want to use multiprocessing since pre-defined list of keyword is very large - 10k keywords and number of article is 100k.

我遇到了这个问题,但它没有解决我的问题问题.

I came across this question but it does not address my question.

我尝试了此实现,但结果为None.

I tried this implementation but getting None as result.

keywords = ["threading", "package", "parallelize"]

def search_worker(keyword):
    score = 0
    article = """
    The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""

   if keyword in article:
        score += 1
    return score

我尝试了以下两种方法,但得到了三个None.

I tried below two method but getting three None as result.

方法1:

 pool = mp.Pool(processes=4)
 result = [pool.apply(search_worker, args=(keyword,)) for keyword in keywords]

方法2:

result = pool.map(search_worker, keywords)
print(result)

实际输出: [无,无,无]

预期输出: 3

我考虑将工人的预定义关键字列表和文章一起发送给工作人员,但是由于我没有多处理的经验,因此我不确定我是否朝着正确的方向前进.

I think of sending the worker the pre-defined list of keyword and the article all together, but I am not sure if I am going in right direction as I don't have prior experience of multiprocessing.

谢谢.

推荐答案

这是使用Pool的函数.您可以传递text和keyword_list,它将起作用.您可以使用Pool.starmap传递(text, keyword)的元组,但是您需要处理一个对text具有10k引用的可迭代对象.

Here's a function using Pool. You can pass text and keyword_list and it will work. You could use Pool.starmap to pass tuples of (text, keyword), but you would need to deal with an iterable that had 10k references to text.

from functools import partial
from multiprocessing import Pool

def search_worker(text, keyword):
    return int(keyword in text)

def parallel_search_text(text, keyword_list):
    processes = 4
    chunk_size = 10
    total = 0
    func = partial(search_worker, text)
    with Pool(processes=processes) as pool:
        for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
            total += result

    return total

if __name__ == '__main__':
    texts = []  # a list of texts
    keywords = []  # a list of keywords
    for text in texts:
        print(parallel_search_text(text, keywords))

创建工作人员池有开销.可能值得对一个简单的单进程文本搜索功能进行时间​​测试.通过创建Pool的一个实例并将其传递给函数,可以加快重复调用的速度.

There is overhead in creating a pool of workers. It might be worth timeit-testing this against a simple single-process text search function. Repeat calls can be sped up by creating one instance of Pool and passing it into the function.

def parallel_search_text2(text, keyword_list, pool):
    chunk_size = 10
    results = 0
    func = partial(search_worker, text)

    for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
        results += result
    return results

if __name__ == '__main__':
    pool = Pool(processes=4)
    texts = []  # a list of texts
    keywords = []  # a list of keywords
    for text in texts:
        print(parallel_search_text2(text, keywords, pool))

这篇关于如何在Python多处理中的所有进程之间共享数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆