在python中并行嵌套此for循环 [英] Parallelize this nested for loop in python
问题描述
我再次在努力提高这段代码的执行时间.由于计算确实很耗时,我认为最好的解决方案是并行化代码.
I'm struggling again to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code.
I was first working with maps as explained in this question, but then I tried a more simple approach thinking that I could find a better solution. However I couldn't come up with anything yet, so since it's a different problem I decided to post it as a new question.
我正在使用Python 3.4在Windows平台上工作.
I am working on a Windows platform, using Python 3.4.
代码如下:
similarity_matrix = [[0 for x in range(word_count)] for x in range(word_count)]
for i in range(0, word_count):
for j in range(0, word_count):
if i > j:
similarity = calculate_similarity(t_matrix[i], t_matrix[j])
similarity_matrix[i][j] = similarity
similarity_matrix[j][i] = similarity
这是calculate_similarity
函数:
def calculate_similarity(array_word1, array_word2):
denominator = sum([array_word1[i] + array_word2[i] for i in range(word_count)])
if denominator == 0:
return 0
numerator = sum([2 * min(array_word1[i], array_word2[i]) for i in range(word_count)])
return numerator / denominator
以及代码说明:
-
word_count
是列表中存储的唯一单词的总数 -
t_matrix
是一个矩阵,其中包含每对单词的值 - 输出应为
similarity_matrix
,维度为word_count x word_count
,并且每对单词都包含相似度值 - 可以将两个矩阵都保留在内存中
- 经过这些计算,我可以轻松找到每个单词最相似的单词(或根据任务的需要,找到前三个相似的单词)
-
calculate_similarity
包含两个浮点列表,每个浮点列表用于一个单独的单词(每个单词都是t_matrix中的一行)
word_count
is the total number of unique words stored in a listt_matrix
is a matrix containing a value for each pair of words- the output should be
similarity_matrix
whose dimension isword_count x word_count
also containing a similarity value for each pair of words - it's ok to keep both matrices in memory
- after these computations I can easily find the most similar word for each words (or the top three similar words, as the task may require)
calculate_similarity
takes two float lists, each for a separate word (each is a row in the t_matrix)
我处理的列表包含13000个单词,如果我计算正确,系统上的执行时间将是几天.因此,任何可以在一天内完成这项工作的东西都将很棒!
I work with a list of 13k words, and if I calculated correctly the execution time on my system would be a few days. So, anything that will do the job in one day would be wonderful!
也许仅将calculate_similarity
中的numerator
和denominator
的计算参数化会带来重大改进.
Maybe only parellelizing the calculation of numerator
and denominator
in calculate_similarity
would make a significant improvement.
推荐答案
from concurrent.futures import ProcessPoolExecutor, Future, wait
from itertools import combinations
from functools import partial
similarity_matrix = [[0]*word_count for _ in range(word_count)]
def callback(i, j, future):
similarity_matrix[i][j] = future.result()
similarity_matrix[j][i] = future.result()
with ProcessPoolExecutor(max_workers=4) as executer:
fs = []
for i, j in combinations(range(wordcount), 2):
future = excuter.submit(
calculate_similarity,
t_matrix[i],
t_matrix[j])
future.add_done_callback(partial(callback, i, j))
fs.append(future)
wait(fs)
这篇关于在python中并行嵌套此for循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!