从嵌套字典计算单词/文档向量之间的距离 [英] Calculating distance between word/document vectors from a nested dictionary

查看：190 发布时间：2017/5/21 22:58:28 python dictionary vector nlp text-processing

本文介绍了从嵌套字典计算单词/文档向量之间的距离的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个嵌套的字典：

  myDict = {'a'：{1：2，2：163 ，3:12，4:67，5:84}，
'about'：{1:27，2:45，3:21，4:10，5:15}，
'苹果'：{1：0,2：5,3：0,4：10,5：0}，
'expectate'：{1：1,2：5,3：0,4：8,5 ：7}，
'an'：{1：3，2:15，3：1，4：312,5：100}}

外键是一个单词，

内部键是文件/文档标识

值是字（外键出现）的次数

计算内部键的平方值的总和？例如，对于内键编号 1 ，我应该得到：

  2 ^ 2 + 27 ^ 2 + 0 ^ 2 + 1 ^ 2 + 3 ^ 2

因为内键 1 在'a'中出现2次，约为27次，0次苹果，1次预期，3次在一个

< hr>

给定嵌套字典对象如何找到一对文件/文档之间的距离？

例如，文件/文档id 1 和 2 之间的距离将如下计算：

  doc1 = {'a'：2，'about'：27，'apple'：0，'expectedate' ，'a'：3}＃（即内部键`1'）
 doc2 = {'a'：163，'about'：45，'apple'：5，'expectate'：5，'an'：15} `）

我想知道文档的不同/相似度，所以获得一个浮动数字作为两个文档的距离分数？

如何计算这两个文档的点数？ strong>

我已经尝试通过考虑为每个文档计算单个值：
（（2 * 0）+（27 * 0）+（3 * 1）+（1 * 1）+（0 * 1））/（文件向量的大小*搜索短语矢量的大小）
使用我的代码：

<$ p $对于搜索中的单词，p> vecDist = {} ： myDict.iteritems（）中的fileNum： vecDist [fileNum] =dotproduct/ magnitudeFileVec [fileNum ] * magnitudeSearchVec

解决方案
首先，你的dic字典是一个很好的开始，你正在做什么，但它太复杂了尝试使用 numpy 数组：
import numpy as np vocabulary = ['a'，'about'，'apple'，'expectate'，'an'] matrix = [[2,27,0,1,3]，[163,45,5,5,15]，[12,21,0,0,1]，[67,10,10,8,212]， [84，15，0，7，100]] matrix = np.array（matrix）打印矩阵 / pre>

[out]：
[[2 27 0 1 3] [163 45 5 5 15] [12 21 0 0 1] [67 10 10 8 312] [84 15 0 7 100]]
现在您可以清楚地看到，您的行是文档，您的列是字数。

要访问术语/字矢量（即列表）：

for i，term in enumerate（vocabulary）： vector = matrix [：，i] print term，vector，vector.sum（）
[out]：
a [2 163 12 67 84] 328 约[27 45 21 10 15] 118 apple [0 5 0 10 0] 15 expectate [1 5 0 8 7] 21 an [3 15 1 312 100] 431
要访问文档向量（即行）：
for i，document in enumerate（matrix）： print i，document
[out]：
0 [2 27 0 1 3] 1 [163 45 5 5 15] 2 [12 21 0 0 1] 3 [67 10 10 8 312] 4 [84 15 0 7 100]
要访问单个行：
doc1 = matrix [0 ,:] doc2 = matrix [1，...] 打印doc1 打印doc2
[out]：
[2 27 0 1 3] [163 45 5 5 15]
计算平方和矢量中的值：
`np.sum（doc1 ** 2）`
[out]：
743
要计算两个向量之间的点积，只需：
print np.dot（doc1，doc2）
[out]：
1591
如果你完全陷入了嵌套的字典，这里是如何将它转换成numpy数组：
import numpy as np myDict = {'a'：{1：2，2：163，3:12，4:67，5:84}， ' '：{1:27，2:45，3:21，4:10，5:15}， 'apple'：{1：0,2：5,3：0,4：10,5 ：0}， 'expectate'：{1：1，2：5,3：0,4：8,5：7}， 'an'：{1：3，2:15 ，3：1，4：312，5：100}} vocabulary = myDict.keys（） matrix = [[myDict [i ] [j] for myDict [i]] for my in myDict] matrix = np.array（matrix） matrix = np.transpose（matrix） print矩阵
[out]：
[[2 27 0 1 3] [163 45 5 5 15] [12 21 0 0 1] [67 10 10 8 312] [84 15 0 7 100]]

I have a nested dictionary as such:
myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84}, 'about': {1:27, 2:45, 3:21, 4:10, 5:15}, 'apple': {1:0, 2: 5, 3:0, 4:10, 5:0}, 'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7}, 'an': {1:3, 2:15, 3:1, 4:312, 5:100}}

The outer key is a word,

the inner keys are file/document ids

the values are the number of times the word (outer key occurs)

How do I calculate the sum of the square values to the inner keys? For example for the inner key number 1, I should get:
2^2 + 27^2 + 0^2 + 1^2 + 3^2
because the inner key 1 appears 2 times in 'a', 27 times in 'about', 0 times in apple, 1 time in 'anticipate' and 3 times in 'an'

Given the nested dictionary object how do I find the distance between a pair of files/documents?

For example, the distance between the file/document id 1 and 2 would be calculate as such:
doc1 = {'a':2, 'about':27, 'apple':0, 'anticipate':1, 'an':3} # (i.e. inner key `1`) doc2 = {'a':163, 'about':45, 'apple':5, 'anticipate':5, 'an':15} # (i.e. inner key `1`)
I want to know how different/similar the documents are, so how do I get a single floating number as a distance score for the two documents?

How do I calculate the dot product across these two documents?

I've tried calculating a single value for each document by considering:
((2*0) + (27*0) + (3*1) + (1*1) + (0*1)) / (magnitude of file vector * magnitude of search phrase vector)
Using my code as such:
vecDist = {} for word in search: for fileNum in myDict.iteritems(): vecDist[fileNum] = "dotproduct" / magnitudeFileVec[fileNum] * magnitudeSearchVec

解决方案
Firstly, your dictionary of dictionary is a nice start for what you're doing but it's too convoluted try using a numpy array:
import numpy as np vocabulary = ['a', 'about', 'apple', 'anticipate', 'an'] matrix = [[2,27, 0, 1, 3], [163, 45, 5, 5, 15], [12, 21, 0, 0, 1], [67, 10, 10, 8, 312], [84, 15, 0, 7, 100]] matrix = np.array(matrix) print matrix
[out]:
[[ 2 27 0 1 3] [163 45 5 5 15] [ 12 21 0 0 1] [ 67 10 10 8 312] [ 84 15 0 7 100]]
Now you can clearly see that that you rows are documents and your columns are word counts.

To access the term/word vector (i.e. the column):
for i, term in enumerate(vocabulary): vector = matrix[:,i] print term, vector, vector.sum()
[out]:
a [ 2 163 12 67 84] 328 about [27 45 21 10 15] 118 apple [ 0 5 0 10 0] 15 anticipate [1 5 0 8 7] 21 an [ 3 15 1 312 100] 431
To access the document vector (i.e. the row):
for i, document in enumerate(matrix): print i, document
[out]:
0 [ 2 27 0 1 3] 1 [163 45 5 5 15] 2 [12 21 0 0 1] 3 [ 67 10 10 8 312] 4 [ 84 15 0 7 100]
To access individual row:
doc1 = matrix[0,:] doc2 = matrix[1,:] print doc1 print doc2
[out]:
[ 2 27 0 1 3] [163 45 5 5 15]
To calculate sum of square values in a vector:
`np.sum(doc1**2)`
[out]:
743
To calculate the dot product between two vector, simply:
print np.dot(doc1, doc2)
[out]:
1591
If you're totally stuck with the nested dictionaries, here's how to convert it into a numpy array:
import numpy as np myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84}, 'about': {1:27, 2:45, 3:21, 4:10, 5:15}, 'apple': {1:0, 2: 5, 3:0, 4:10, 5:0}, 'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7}, 'an': {1:3, 2:15, 3:1, 4:312, 5:100}} vocabulary = myDict.keys() matrix = [[myDict[i][j] for j in myDict[i]] for i in myDict] matrix = np.array(matrix) matrix = np.transpose(matrix) print matrix
[out]:
[[ 2 27 0 1 3] [163 45 5 5 15] [ 12 21 0 0 1] [ 67 10 10 8 312] [ 84 15 0 7 100]]

这篇关于从嵌套字典计算单词/文档向量之间的距离的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从嵌套字典计算单词/文档向量之间的距离 [英] Calculating distance between word/document vectors from a nested dictionary

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从嵌套字典计算单词/文档向量之间的距离 [英] Calculating distance between word/document vectors from a nested dictionary

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭