Python多重处理-Pool.map仅运行一项任务(而不是多项任务) [英] Python multiprocessing - Pool.map running only one task (instead of multiple)

查看:73
本文介绍了Python多重处理-Pool.map仅运行一项任务(而不是多项任务)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个代码,可以解析大量XML文件(使用xml.sax库)以提取数据以供将来进行机器学习.我希望解析部分可以并行运行(我在服务器上有24个核心,同时还提供一些Web服务,因此我决定使用其中的20个).解析后,我想合并结果.以下代码应该(确实正在)执行我期望的操作,但是并行操作存在问题.

I have a code that parses quite big amount of XML files (using xml.sax library) to extract data for future machine learning. I want the parsing part to run in parallel (I have 24 cores on a server doing also some web services, so I decided to use 20 of them). After the parsing I want to merge the results. The following code should do (and is doing) exactly what I expected, but there is a problem with the parallel thing.

def runParse(fname):
    parser = make_parser()
    handler = MyXMLHandler()
    parser.setContentHandler(handler)
    parser.parse(fname)
    return handler.getResult()

def makeData(flist, tasks=20):
    pool = Pool(processes=tasks)
    tmp = pool.map(runParse, flist)
    for result in tmp:
        # and here the merging part

该部分启动时会在20个内核上运行一段时间,然后仅运行至一个,然后在合并部分之前发生(当然,合并部分将仅在一个内核上运行).

When this part starts it runs for a while on 20 cores and then goes to only one, and it happens before the merging part (which will of course run on only one core).

任何人都可以帮助解决此问题或提出加速程序的方法吗?

Can anyone help to solve this problem or suggest a way to speed up the program?

谢谢!

ppiikkaaa

ppiikkaaa

推荐答案

为什么说在完成之前只涉及一个?

Why do you say it goes to only one before completing?

您正在使用.map()来收集结果,然后返回. 因此,对于大型数据集,您可能陷入了收集阶段.

You're using .map() which collect the results and then returns. So for large dataset probably you're stuck in the collecting phase.

如果分析顺序不重要(如您的示例所示),则可以尝试使用.imap(),它是.map()甚至是.imap_unordered()上的迭代器版本.

You can try using .imap() which is the iterator version on .map() or even the .imap_unordered() if the order of analysis is not important (as it seems from your example).

此处是相关文件. 值得注意的是:

Here's the relevant documentation. Worth noting the line:

对于很长的可迭代对象,使用大块大小的值可以比使用默认值1更快地完成任务.

这篇关于Python多重处理-Pool.map仅运行一项任务(而不是多项任务)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆