并行化python中的嵌套for循环以查找最大值 [英] Parallelize a nested for loop in python for finding the max value

本文介绍了并行化python中的嵌套for循环以查找最大值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在花一些时间来改善这段代码的执行时间.由于计算确实很耗时,我认为最好的解决方案是并行化代码. 输出也可以存储在内存中,然后写入文件中.

I'm struggling for some time to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code. The output could be also stored in memory, and written to a file afterwards.

我对Python和并行性都是陌生的,因此我很难在此处此处.我还发现了这个问题,但我无法解决弄清楚如何针对我的情况实施相同的操作. 我正在使用Python 3.4在Windows平台上工作.

I am new to both Python and parallelism, so I find it difficult to apply the concepts explained here and here. I also found this question, but I couldn't manage to figure out how to implement the same for my situation. I am working on a Windows platform, using Python 3.4.

for i in range(0, len(unique_words)):
    max_similarity = 0        
    max_similarity_word = ""
    for j in range(0, len(unique_words)):
        if not i == j:
            similarity = calculate_similarity(global_map[unique_words[i]], global_map[unique_words[j]])
            if similarity > max_similarity:
                 max_similarity = similarity
                 max_similarity_word = unique_words[j]
    file_co_occurring.write(
        unique_words[i] + "\t" + max_similarity_word + "\t" + str(max_similarity) + "\n")

如果您需要代码说明:

  • unique_words是单词(字符串)的列表
  • global_map是一本词典,其键为单词(global_map.keys()包含与unique_words相同的元素),并且值是以下格式的字典:{word:value},其中单词是该单词的子集. unique_words
  • 中的值
  • 对于每个单词,我都会根据global_map中的值寻找最相似的单词.我不希望将每个相似性都存储在内存中,因为地图已经占用了太多的东西.
  • calculate_similarity返回0到1之间的值
  • 对于unique_words中的每个单词,结果应包含最相似的单词(最相似的单词应与单词本身不同,这就是为什么我添加条件if not i == j的原因,但是如果我检查max_similarity是否不同于1)
  • 如果单词的max_similarity为0,则最相似的单词为空字符串就可以了
  • unique_words is a list of words (strings)
  • global_map is a dictionary whose keys are words(global_map.keys() contains the same elements as unique_words) and the values are dictionaries of the following format: {word: value}, where the words are a subset of the values in unique_words
  • for each word, I look for the most similar word based on its value in global_map. I wouldn't prefer to store each similarity in memory since the maps already take too much.
  • calculate_similarity returns a value from 0 to 1
  • the result should contain the most similar word for each of the words in unique_words (the most similar word should be different than the word itself, that's why I added the condition if not i == j, but this can be also done if I check if max_similarity is different than 1)
  • if the max_similarity for a word is 0, it's OK if the most similar word is the empty string

推荐答案

以下是适合您的解决方案.我最终更改了很多代码,因此请询问您是否有任何问题.

Here is a solution that should work for you. I ended up changing a lot of your code so please ask if you have any questions.

这不是完成此任务的唯一方法,特别是这不是一种内存有效的解决方案.

This is far from the only way to accomplish this, and in particular this is not a memory efficient solution.

您需要将max_workers设置为适合您的内容.通常,您计算机中逻辑处理器的数量是一个很好的起点.

You will need to set max_workers to something that works for you. Usually the number of logical processors in your machine is a good starting point.

from concurrent.futures import ThreadPoolExecutor, Future
from itertools import permutations
from collections import namedtuple, defaultdict

Result = namedtuple('Result', ('value', 'word'))

def new_calculate_similarity(word1, word2):
    return Result(
        calculate_similarity(global_map[word1], global_map[word2]),
        word2)

with ThreadPoolExecutor(max_workers=4) as executer:
    futures = defaultdict(list)
    for word1, word2 in permutations(unique_words, r=2):
            futures[word1].append(
                executer.submit(new_calculate_similarity, word1, word2))

    for word in futures:
        # this will block until all calculations have completed for 'word'
        results = map(Future.result, futures[word])
        max_result = max(results, key=lambda r: r.value) 
        print(word, max_result.word, max_result.value, 
            sep='\t', 
            file=file_co_occurring)

这是我使用的库的文档:

Here are the docs for the libraries I used:

  • Futures
  • collections
  • itertools

这篇关于并行化python中的嵌套for循环以查找最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆