并行化python中的嵌套for循环以查找最大值 [英] Parallelize a nested for loop in python for finding the max value
问题描述
我正在花一些时间来改善这段代码的执行时间.由于计算确实很耗时,我认为最好的解决方案是并行化代码. 输出也可以存储在内存中,然后写入文件中.
I'm struggling for some time to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code. The output could be also stored in memory, and written to a file afterwards.
我对Python和并行性都是陌生的,因此我很难在此处和此处.我还发现了这个问题,但我无法解决弄清楚如何针对我的情况实施相同的操作. 我正在使用Python 3.4在Windows平台上工作.
I am new to both Python and parallelism, so I find it difficult to apply the concepts explained here and here. I also found this question, but I couldn't manage to figure out how to implement the same for my situation. I am working on a Windows platform, using Python 3.4.
for i in range(0, len(unique_words)):
max_similarity = 0
max_similarity_word = ""
for j in range(0, len(unique_words)):
if not i == j:
similarity = calculate_similarity(global_map[unique_words[i]], global_map[unique_words[j]])
if similarity > max_similarity:
max_similarity = similarity
max_similarity_word = unique_words[j]
file_co_occurring.write(
unique_words[i] + "\t" + max_similarity_word + "\t" + str(max_similarity) + "\n")
如果您需要代码说明:
-
unique_words
是单词(字符串)的列表 -
global_map
是一本词典,其键为单词(global_map.keys()
包含与unique_words
相同的元素),并且值是以下格式的字典:{word:value},其中单词是该单词的子集.unique_words
中的值
- 对于每个单词,我都会根据
global_map
中的值寻找最相似的单词.我不希望将每个相似性都存储在内存中,因为地图已经占用了太多的东西. -
calculate_similarity
返回0到1之间的值 - 对于
unique_words
中的每个单词,结果应包含最相似的单词(最相似的单词应与单词本身不同,这就是为什么我添加条件if not i == j
的原因,但是如果我检查max_similarity
是否不同于1) - 如果单词的
max_similarity
为0,则最相似的单词为空字符串就可以了
unique_words
is a list of words (strings)global_map
is a dictionary whose keys are words(global_map.keys()
contains the same elements asunique_words
) and the values are dictionaries of the following format: {word: value}, where the words are a subset of the values inunique_words
- for each word, I look for the most similar word based on its value in
global_map
. I wouldn't prefer to store each similarity in memory since the maps already take too much. calculate_similarity
returns a value from 0 to 1- the result should contain the most similar word for each of the words in
unique_words
(the most similar word should be different than the word itself, that's why I added the conditionif not i == j
, but this can be also done if I check ifmax_similarity
is different than 1) - if the
max_similarity
for a word is 0, it's OK if the most similar word is the empty string
推荐答案
以下是适合您的解决方案.我最终更改了很多代码,因此请询问您是否有任何问题.
Here is a solution that should work for you. I ended up changing a lot of your code so please ask if you have any questions.
这不是完成此任务的唯一方法,特别是这不是一种内存有效的解决方案.
This is far from the only way to accomplish this, and in particular this is not a memory efficient solution.
您需要将max_workers设置为适合您的内容.通常,您计算机中逻辑处理器的数量是一个很好的起点.
You will need to set max_workers to something that works for you. Usually the number of logical processors in your machine is a good starting point.
from concurrent.futures import ThreadPoolExecutor, Future
from itertools import permutations
from collections import namedtuple, defaultdict
Result = namedtuple('Result', ('value', 'word'))
def new_calculate_similarity(word1, word2):
return Result(
calculate_similarity(global_map[word1], global_map[word2]),
word2)
with ThreadPoolExecutor(max_workers=4) as executer:
futures = defaultdict(list)
for word1, word2 in permutations(unique_words, r=2):
futures[word1].append(
executer.submit(new_calculate_similarity, word1, word2))
for word in futures:
# this will block until all calculations have completed for 'word'
results = map(Future.result, futures[word])
max_result = max(results, key=lambda r: r.value)
print(word, max_result.word, max_result.value,
sep='\t',
file=file_co_occurring)
这是我使用的库的文档:
Here are the docs for the libraries I used:
- Futures
- collections
- itertools
这篇关于并行化python中的嵌套for循环以查找最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!