如何有效地计算两个字典的内积 [英] How to efficiently compute the inner product of two dictionaries
问题描述
假设我使用字典表示特征向量(为什么?因为我知道特征稀疏,但是稍后会更多).
Suppose I represent a feature vector using a dictionary (why? because I know the features are sparse, but, more on that later).
我应该如何实现两个这样的字典(表示为A,B)的内积
How should I implement the inner product of two such dictionaries (denoted, A, B)
我尝试过幼稚的方法:
for k in A:
if k in B:
sum += A[k] * B[k]
但事实证明它很慢.
更多详细信息:
-
我之所以使用字典来表示要素,是因为
I'm using a dictionary to represent features because
- 功能键是字符串
- 大约有20K个密钥
- 每个向量都是稀疏的(例如,约有1000个非零元素).
我真的很想计算N = 2000个不同字典(即它们的线性核)的成对内积.
I'm really interested in computing the pairwise inner product of N=2000 different dictionaries (that is, their linear kernel).
推荐答案
这是我的答案(根据@ valentin-clement的建议):
Here's my answer (Following on a suggestion by @valentin-clement):
首先,我包装了一个scipy.sparse dok_matrix. 这个想法是给每个可能的特征分配一个索引.
First I wrap a scipy.sparse dok_matrix. The idea is to assign each of the possible features an index.
import scipy.sparse as sps
import numpy as np
class MSK:
# DD is a dict of dict, whose values are of type float.
# features - the set of possible features keys
def __init__(self, DD, features):
self.features = {k: j for (j, k) in enumerate(features)}
self.strings = DD.keys()
n = len(self.strings)
d = len(self.features)
self.M = sps.dok_matrix((n, d), dtype=np.float64)
for (i, s) in enumerate(self.strings):
v = DD[s]
for k in v:
j = self.features[k]
self.M[i, j] = v[k]
我们使用以下代码进行测试,其中元素数为800,维数也为800,但稀疏度为200(恰好200个元素为非零).
And we test using the following code, where the number of elements is 800, the dimensionality is also 800, but the sparsity is 200 (exactly 200 elements are non-zero).
np.random.seed(1)
N = 800
DD = dict()
R = range(N)
for i in xrange(N):
DD[i] = dict()
S = np.random.permutation(R)
S = S[:N/4]
for j in S:
DD[i][j] = np.random.randn(1)[0]
K = MSK(DD, R)
import cProfile
cProfile.runctx("A = K.M * K.M.T", globals(), locals())
print A.todense()
输出为:
2080520 function calls (2080519 primitive calls) in 2.884 seconds
让我们说3秒. 我的幼稚实现(使用@Joowani的if语句)大约需要19秒.
Lets say 3 seconds. My naive implementation (that uses @Joowani's if-statement) takes about 19 seconds.
(MSK代表MatrixSparseKeys)
(MSK stands for MatrixSparseKeys)
这篇关于如何有效地计算两个字典的内积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!