并行文件解析,多个CPU内核 [英] parallel file parsing, multiple CPU cores
问题描述
我早些时候问了一个相关但非常笼统的问题(尤其请参见此响应).
I asked a related but very general question earlier (see especially this response).
这个问题很具体.这就是我关心的所有代码:
This question is very specific. This is all the code I care about:
result = {}
for line in open('input.txt'):
key, value = parse(line)
result[key] = value
函数parse
是完全独立的(即不使用任何共享资源).
The function parse
is completely self-contained (i.e., doesn't use any shared resources).
我有Intel i7-920 CPU(4核,8个线程;我认为这些线程更相关,但我不确定).
I have Intel i7-920 CPU (4 cores, 8 threads; I think the threads are more relevant, but I'm not sure).
如何使我的程序使用此CPU的所有并行功能?
What can I do to make my program use all the parallel capabilities of this CPU?
我假设我可以打开该文件以便在8个不同的线程中进行读取,而不会造成太多的性能损失,因为磁盘访问时间相对于总时间而言很小.
I assume I can open this file for reading in 8 different threads without much performance penalty since disk access time is small relative to the total time.
推荐答案
cPython不提供您要轻松查找的线程模型.您可以使用multiprocessing
模块和进程池得到类似的结果
cPython does not provide the threading model you are looking for easily. You can get something similar using the multiprocessing
module and a process pool
这样的解决方案可能看起来像这样:
such a solution could look something like this:
def worker(lines):
"""Make a dict out of the parsed, supplied lines"""
result = {}
for line in lines.split('\n'):
k, v = parse(line)
result[k] = v
return result
if __name__ == '__main__':
# configurable options. different values may work better.
numthreads = 8
numlines = 100
lines = open('input.txt').readlines()
# create the process pool
pool = multiprocessing.Pool(processes=numthreads)
# map the list of lines into a list of result dicts
result_list = pool.map(worker,
(lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) )
# reduce the result dicts into a single dict
result = {}
map(result.update, result_list)
这篇关于并行文件解析,多个CPU内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!