如何有效地计算两个字典的内积 [英] How to efficiently compute the inner product of two dictionaries

查看:119
本文介绍了如何有效地计算两个字典的内积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我使用字典表示特征向量(为什么?因为我知道特征稀疏,但是稍后会更多).

Suppose I represent a feature vector using a dictionary (why? because I know the features are sparse, but, more on that later).

我应该如何实现两个这样的字典(表示为A,B)的内积

How should I implement the inner product of two such dictionaries (denoted, A, B)

我尝试过幼稚的方法:

for k in A:
  if k in B:
    sum += A[k] * B[k]

但事实证明它很慢.

更多详细信息:

  • 我之所以使用字典来表示要素,是因为

  • I'm using a dictionary to represent features because

  1. 功能键是字符串
  2. 大约有20K个密钥
  3. 每个向量都是稀疏的(例如,约有1000个非零元素).

  • 我真的很想计算N = 2000个不同字典(即它们的线性核)的成对内积.

  • I'm really interested in computing the pairwise inner product of N=2000 different dictionaries (that is, their linear kernel).

    推荐答案

    这是我的答案(根据@ valentin-clement的建议):

    Here's my answer (Following on a suggestion by @valentin-clement):

    首先,我包装了一个scipy.sparse dok_matrix. 这个想法是给每个可能的特征分配一个索引.

    First I wrap a scipy.sparse dok_matrix. The idea is to assign each of the possible features an index.

    import scipy.sparse as sps
    import numpy as np
    
    class MSK:
        # DD is a dict of dict, whose values are of type float.
        # features - the set of possible features keys
        def __init__(self, DD, features):
            self.features = {k: j for (j, k) in enumerate(features)}
            self.strings = DD.keys()
            n = len(self.strings)
            d = len(self.features)
            self.M = sps.dok_matrix((n, d), dtype=np.float64)
            for (i, s) in enumerate(self.strings):
                v = DD[s]
                for k in v:
                    j = self.features[k]
                    self.M[i, j] = v[k]
    

    我们使用以下代码进行测试,其中元素数为800,维数也为800,但稀疏度为200(恰好200个元素为非零).

    And we test using the following code, where the number of elements is 800, the dimensionality is also 800, but the sparsity is 200 (exactly 200 elements are non-zero).

    np.random.seed(1)
    N = 800
    DD = dict()
    R = range(N)
    for i in xrange(N):
        DD[i] = dict()
        S = np.random.permutation(R)
        S = S[:N/4]
        for j in S:
            DD[i][j] = np.random.randn(1)[0]
    
    K = MSK(DD, R)
    import cProfile
    cProfile.runctx("A = K.M * K.M.T", globals(), locals())
    print A.todense()
    

    输出为:

    2080520 function calls (2080519 primitive calls) in 2.884 seconds
    

    让我们说3秒. 我的幼稚实现(使用@Joowani的if语句)大约需要19秒.

    Lets say 3 seconds. My naive implementation (that uses @Joowani's if-statement) takes about 19 seconds.

    (MSK代表MatrixSparseKeys)

    (MSK stands for MatrixSparseKeys)

    这篇关于如何有效地计算两个字典的内积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆