从嵌套字典计算单词/文档向量之间的距离 [英] Calculating distance between word/document vectors from a nested dictionary
问题描述
我有一个嵌套的字典:
myDict = {'a':{1:2,2:163 ,3:12,4:67,5:84},
'about':{1:27,2:45,3:21,4:10,5:15},
'苹果':{1:0,2:5,3:0,4:10,5:0},
'expectate':{1:1,2:5,3:0,4:8,5 :7},
'an':{1:3,2:15,3:1,4:312,5:100}}
- 外键是一个单词,
- 内部键是文件/文档标识
- 值是字(外键出现)的次数
计算内部键的平方值的总和?例如,对于内键编号 1
,我应该得到:
2 ^ 2 + 27 ^ 2 + 0 ^ 2 + 1 ^ 2 + 3 ^ 2
因为内键 1
在'a'中出现2次,约为27次,0次苹果,1次预期,3次在一个
< hr>
给定嵌套字典对象如何找到一对文件/文档之间的距离?
例如,文件/文档id 1
和 2
之间的距离将如下计算:
doc1 = {'a':2,'about':27,'apple':0,'expectedate' ,'a':3}#(即内部键`1')
doc2 = {'a':163,'about':45,'apple':5,'expectate':5,'an':15} `)
我想知道文档的不同/相似度,所以获得一个浮动数字作为两个文档的距离分数?
如何计算这两个文档的点数? strong>
我已经尝试通过考虑为每个文档计算单个值:
((2 * 0)+(27 * 0)+(3 * 1)+(1 * 1)+(0 * 1))/(文件向量的大小*搜索短语矢量的大小)
使用我的代码:
<$ p $对于搜索中的单词,p>
vecDist = {}
:
myDict.iteritems()中的fileNum:
vecDist [fileNum] =dotproduct/ magnitudeFileVec [fileNum ] * magnitudeSearchVec
首先,你的dic字典是一个很好的开始,你正在做什么,但它太复杂了尝试使用 numpy
数组:
import numpy as np
/ pre>
vocabulary = ['a','about','apple','expectate','an']
matrix = [[2,27,0,1,3],[163,45,5,5,15],[12,21,0,0,1],[67,10,10,8,212], [84,15,0,7,100]]
matrix = np.array(matrix)
打印矩阵
[out]:
[[2 27 0 1 3]
[163 45 5 5 15]
[12 21 0 0 1]
[67 10 10 8 312]
[84 15 0 7 100]]
现在您可以清楚地看到,您的行是文档,您的列是字数。
要访问术语/字矢量(即列表):
for i,term in enumerate(vocabulary):
vector = matrix [:,i]
print term,vector,vector.sum()
[out]:
a [2 163 12 67 84] 328
约[27 45 21 10 15] 118
apple [0 5 0 10 0] 15
expectate [1 5 0 8 7] 21
an [3 15 1 312 100] 431
要访问文档向量(即行):
for i,document in enumerate(matrix):
print i,document
[out]:
0 [2 27 0 1 3]
1 [163 45 5 5 15]
2 [12 21 0 0 1]
3 [67 10 10 8 312]
4 [84 15 0 7 100]
要访问单个行:
doc1 = matrix [0 ,:]
doc2 = matrix [1,...]
打印doc1
打印doc2
[out]:
[2 27 0 1 3]
[163 45 5 5 15]
计算平方和矢量中的值:
`np.sum(doc1 ** 2)`
[out]:
743
要计算两个向量之间的点积,只需:
print np.dot(doc1,doc2)
[out]:
1591
如果你完全陷入了嵌套的字典,这里是如何将它转换成numpy数组:
import numpy as np
myDict = {'a':{1:2,2:163,3:12,4:67,5:84},
' ':{1:27,2:45,3:21,4:10,5:15},
'apple':{1:0,2:5,3:0,4:10,5 :0},
'expectate':{1:1,2:5,3:0,4:8,5:7},
'an':{1:3,2:15 ,3:1,4:312,5:100}}
vocabulary = myDict.keys()
matrix = [[myDict [i ] [j] for myDict [i]] for my in myDict]
matrix = np.array(matrix)
matrix = np.transpose(matrix)
print矩阵
[out]:
[[2 27 0 1 3]
[163 45 5 5 15]
[12 21 0 0 1]
[67 10 10 8 312]
[84 15 0 7 100]]
I have a nested dictionary as such:
myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84}, 'about': {1:27, 2:45, 3:21, 4:10, 5:15}, 'apple': {1:0, 2: 5, 3:0, 4:10, 5:0}, 'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7}, 'an': {1:3, 2:15, 3:1, 4:312, 5:100}}
- The outer key is a word,
- the inner keys are file/document ids
- the values are the number of times the word (outer key occurs)
How do I calculate the sum of the square values to the inner keys? For example for the inner key number 1
, I should get:
2^2 + 27^2 + 0^2 + 1^2 + 3^2
because the inner key 1
appears 2 times in 'a', 27 times in 'about', 0 times in apple, 1 time in 'anticipate' and 3 times in 'an'
Given the nested dictionary object how do I find the distance between a pair of files/documents?
For example, the distance between the file/document id 1
and 2
would be calculate as such:
doc1 = {'a':2, 'about':27, 'apple':0, 'anticipate':1, 'an':3} # (i.e. inner key `1`)
doc2 = {'a':163, 'about':45, 'apple':5, 'anticipate':5, 'an':15} # (i.e. inner key `1`)
I want to know how different/similar the documents are, so how do I get a single floating number as a distance score for the two documents?
How do I calculate the dot product across these two documents?
I've tried calculating a single value for each document by considering:
((2*0) + (27*0) + (3*1) + (1*1) + (0*1)) / (magnitude of file vector * magnitude of search phrase vector)
Using my code as such:
vecDist = {}
for word in search:
for fileNum in myDict.iteritems():
vecDist[fileNum] = "dotproduct" / magnitudeFileVec[fileNum] * magnitudeSearchVec
Firstly, your dictionary of dictionary is a nice start for what you're doing but it's too convoluted try using a numpy
array:
import numpy as np
vocabulary = ['a', 'about', 'apple', 'anticipate', 'an']
matrix = [[2,27, 0, 1, 3], [163, 45, 5, 5, 15], [12, 21, 0, 0, 1], [67, 10, 10, 8, 312], [84, 15, 0, 7, 100]]
matrix = np.array(matrix)
print matrix
[out]:
[[ 2 27 0 1 3]
[163 45 5 5 15]
[ 12 21 0 0 1]
[ 67 10 10 8 312]
[ 84 15 0 7 100]]
Now you can clearly see that that you rows are documents and your columns are word counts.
To access the term/word vector (i.e. the column):
for i, term in enumerate(vocabulary):
vector = matrix[:,i]
print term, vector, vector.sum()
[out]:
a [ 2 163 12 67 84] 328
about [27 45 21 10 15] 118
apple [ 0 5 0 10 0] 15
anticipate [1 5 0 8 7] 21
an [ 3 15 1 312 100] 431
To access the document vector (i.e. the row):
for i, document in enumerate(matrix):
print i, document
[out]:
0 [ 2 27 0 1 3]
1 [163 45 5 5 15]
2 [12 21 0 0 1]
3 [ 67 10 10 8 312]
4 [ 84 15 0 7 100]
To access individual row:
doc1 = matrix[0,:]
doc2 = matrix[1,:]
print doc1
print doc2
[out]:
[ 2 27 0 1 3]
[163 45 5 5 15]
To calculate sum of square values in a vector:
`np.sum(doc1**2)`
[out]:
743
To calculate the dot product between two vector, simply:
print np.dot(doc1, doc2)
[out]:
1591
If you're totally stuck with the nested dictionaries, here's how to convert it into a numpy array:
import numpy as np
myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84},
'about': {1:27, 2:45, 3:21, 4:10, 5:15},
'apple': {1:0, 2: 5, 3:0, 4:10, 5:0},
'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7},
'an': {1:3, 2:15, 3:1, 4:312, 5:100}}
vocabulary = myDict.keys()
matrix = [[myDict[i][j] for j in myDict[i]] for i in myDict]
matrix = np.array(matrix)
matrix = np.transpose(matrix)
print matrix
[out]:
[[ 2 27 0 1 3]
[163 45 5 5 15]
[ 12 21 0 0 1]
[ 67 10 10 8 312]
[ 84 15 0 7 100]]
这篇关于从嵌套字典计算单词/文档向量之间的距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!