在python中高效地迭代元组列表中的3600万个项目 [英] Iterate over 36 million items in a list of tuples in python efficiently and faster

查看：66 发布时间：2021/5/18 18:38:49 python list performance numpy iteration

本文介绍了在python中高效地迭代元组列表中的3600万个项目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

首先，在任何人将其标记为重复之前，请阅读以下内容.我不确定迭代的延迟是由于巨大的尺寸还是我的逻辑所致.我有一个用例，其中我必须遍历元组列表中的 3600万个项目.我的主要要求是速度和效率.样本列表:

Firstly, before anyone marks it as a duplicate, please read below. I am unsure if the delay in the iteration is due to the huge size or my logic. I have a use case where I have to iterate over 36 million items in a list of tuples. My main requirement is speed and efficiency. Sample list:

[
    ('how are you', 'I am fine'),
    ('how are you', 'I am not fine'),
    ...36 million items...
]

到目前为止我所做的:

for query_question in combined:
    query = "{}".format(word_tokenize(query_question[0]))
    question = "{}".format(word_tokenize(query_question[1]))

    # the function uses a naive doc2vec extension of GLOVE word vectors
    vec1 = np.mean([
        word_vector_dict[word]
        for word in literal_eval(query)
        if word in word_vector_dict
    ], axis=0)

    vec2 = np.mean([
        word_vector_dict[word]
        for word in literal_eval(question)
        if word in word_vector_dict
    ], axis=0)

    similarity_score = 1 - distance.cosine(vec1, vec2)
    store_question_score = store_question_score.append(
        (query_question[1], similarity_score)
    ) 
    count += 1

    if(count == len(data_list)):
        store_question_score_descending = store_question_score.sort(
            key=itemgetter(1), reverse=True
        )
        result_dict[query_question[0]] = store_question_score_descending[:5]
        store_question_score =[]
        count = 1

以上逻辑旨在计算问题之间的相似度分数并执行文本相似度算法. 我怀疑迭代的延迟可能是 vec1和vec2 的计算. 如果这样，我该如何做得更好?我正在寻找如何加快速度.

The above logic aims to calculate the similarity scores between questions and perform a text similarity algorithm. I'm suspecting the delay in the iteration could be the calculation of vec1 and vec2. If so, how can I do this better? I am looking for how to speed up the process.

还有很多其他类似的问题，例如遍历庞大的列表，但是我找不到解决我问题的方法.

There are plenty of other questions similar to iterative over huge lists, but I could not find any that solved my problem.

非常感谢您能提供的任何帮助.

I really appreciate any help you can provide.

推荐答案

尝试缓存:

from functools import lru_cache

@lru_cache(maxsize=None)
def compute_vector(s):
    return np.mean([
        word_vector_dict[word]
        for word in literal_eval(s)
        if word in word_vector_dict
    ], axis=0)

然后改用它:

vec1 = compute_vector(query)
vec2 = compute_vector(question)

如果向量的大小是固定的，则可以通过缓存到形状为(num_unique_keys，len(vec1))的numpy数组来做得更好，在这种情况下， num_unique_keys =370000 + 100 :

If the size of the vectors is fixed, you can do even better by caching to a numpy array of shape (num_unique_keys, len(vec1)), where in your case num_unique_keys = 370000 + 100:

class VectorCache:
    def __init__(self, func, num_keys, item_size):
        self.func = func
        self.cache = np.empty((num_keys, item_size), dtype=float)
        self.keys = {}

    def __getitem__(self, key):
        if key in self.keys
            return self.cache[self.keys[key]]
        self.keys[key] = len(self.keys)
        item = self.func(key)
        self.cache[self.keys[key]] = item
        return item


def compute_vector(s):
    return np.mean([
        word_vector_dict[word]
        for word in literal_eval(s)
        if word in word_vector_dict
    ], axis=0)


vector_cache = VectorCache(compute_vector, num_keys, item_size)

然后:

vec1 = vector_cache[query]
vec2 = vector_cache[question]

使用类似的技术，您还可以缓存余弦距离:

Using a similar technique, you can also cache the cosine distances:

@lru_cache(maxsize=None)
def cosine_distance(query, question):
    return distance.cosine(vector_cache[query], vector_cache[question])

这篇关于在python中高效地迭代元组列表中的3600万个项目的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在python中高效地迭代元组列表中的3600万个项目 [英] Iterate over 36 million items in a list of tuples in python efficiently and faster

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在python中高效地迭代元组列表中的3600万个项目 [英] Iterate over 36 million items in a list of tuples in python efficiently and faster

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭