从嵌套字典计算单词/文档向量之间的距离 [英] Calculating distance between word/document vectors from a nested dictionary

查看:190
本文介绍了从嵌套字典计算单词/文档向量之间的距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个嵌套的字典:

  myDict = {'a':{1:2,2:163 ,3:12,4:67,5:84},
'about':{1:27,2:45,3:21,4:10,5:15},
'苹果':{1:0,2:5,3:0,4:10,5:0},
'expectate':{1:1,2:5,3:0,4:8,5 :7},
'an':{1:3,2:15,3:1,4:312,5:100}}




  • 外键是一个单词,

  • 内部键是文件/文档标识

  • 值是字(外键出现)的次数



计算内部键的平方值的总和?例如,对于内键编号 1 ,我应该得到:

  2 ^ 2 + 27 ^ 2 + 0 ^ 2 + 1 ^ 2 + 3 ^ 2 

因为内键 1 在'a'中出现2次,约为27次,0次苹果,1次预期,3次在一个



< hr>

给定嵌套字典对象如何找到一对文件/文档之间的距离?



例如,文件/文档id 1 2 之间的距离将如下计算:

  doc1 = {'a':2,'about':27,'apple':0,'expectedate' ,'a':3}#(即内部键`1')
doc2 = {'a':163,'about':45,'apple':5,'expectate':5,'an':15} `)

我想知道文档的不同/相似度,所以获得一个浮动数字作为两个文档的距离分数?



如何计算这两个文档的点数? strong>



我已经尝试通过考虑为每个文档计算单个值:

 ((2 * 0)+(27 * 0)+(3 * 1)+(1 * 1)+(0 * 1))/(文件向量的大小*搜索短语矢量的大小)

使用我的代码:



<$ p $对于搜索中的单词,p> vecDist = {}

myDict.iteritems()中的fileNum:
vecDist [fileNum] =dotproduct/ magnitudeFileVec [fileNum ] * magnitudeSearchVec


解决方案

首先,你的dic字典是一个很好的开始,你正在做什么,但它太复杂了尝试使用 numpy 数组:

  import numpy as np 

vocabulary = ['a','about','apple','expectate','an']
matrix = [[2,27,0,1,3],[163,45,5,5,15],[12,21,0,0,1],[67,10,10,8,212], [84,15,0,7,100]]

matrix = np.array(matrix)

打印矩阵
/ pre>

[out]:

  [[2 27 0 1 3] 
[163 45 5 5 15]
[12 21 0 0 1]
[67 10 10 8 312]
[84 15 0 7 100]]

现在您可以清楚地看到,您的行是文档,您的列是字数。



要访问术语/字矢量(即列表):

  for i,term in enumerate(vocabulary):
vector = matrix [:,i]
print term,vector,vector.sum()

[out]:

  a [2 163 12 67 84] 328 
约[27 45 21 10 15] 118
apple [0 5 0 10 0] 15
expectate [1 5 0 8 7] 21
an [3 15 1 312 100] 431

要访问文档向量(即行):

  for i,document in enumerate(matrix):
print i,document

[out]:

  0 [2 27 0 1 3] 
1 [163 45 5 5 15]
2 [12 21 0 0 1]
3 [67 10 10 8 312]
4 [84 15 0 7 100]

要访问单个行:

  doc1 = matrix [0 ,:] 
doc2 = matrix [1,...]

打印doc1
打印doc2

[out]:

  [2 27 0 1 3] 
[163 45 5 5 15]

计算平方和矢量中的值:

 `np.sum(doc1 ** 2)`

[out]:

  743 

要计算两个向量之间的点积,只需:

  print np.dot(doc1,doc2)

[out]:

  1591 

如果你完全陷入了嵌套的字典,这里是如何将它转换成numpy数组:

  import numpy as np 

myDict = {'a':{1:2,2:163,3:12,4:67,5:84},
' ':{1:27,2:45,3:21,4:10,5:15},
'apple':{1:0,2:5,3:0,4:10,5 :0},
'expectate':{1:1,2:5,3:0,4:8,5:7},
'an':{1:3,2:15 ,3:1,4:312,5:100}}

vocabulary = myDict.keys()
matrix = [[myDict [i ] [j] for myDict [i]] for my in myDict]
matrix = np.array(matrix)
matrix = np.transpose(matrix)

print矩阵

[out]:

  [[2 27 0 1 3] 
[163 45 5 5 15]
[12 21 0 0 1]
[67 10 10 8 312]
[84 15 0 7 100]]


I have a nested dictionary as such:

myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84}, 
          'about': {1:27, 2:45, 3:21, 4:10, 5:15}, 
          'apple': {1:0, 2: 5, 3:0, 4:10, 5:0}, 
          'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7}, 
          'an': {1:3, 2:15, 3:1, 4:312, 5:100}}

  • The outer key is a word,
  • the inner keys are file/document ids
  • the values are the number of times the word (outer key occurs)

How do I calculate the sum of the square values to the inner keys? For example for the inner key number 1, I should get:

2^2 + 27^2 + 0^2 + 1^2 + 3^2

because the inner key 1 appears 2 times in 'a', 27 times in 'about', 0 times in apple, 1 time in 'anticipate' and 3 times in 'an'


Given the nested dictionary object how do I find the distance between a pair of files/documents?

For example, the distance between the file/document id 1 and 2 would be calculate as such:

doc1 =  {'a':2, 'about':27, 'apple':0, 'anticipate':1, 'an':3} # (i.e. inner key `1`)
doc2 =  {'a':163, 'about':45, 'apple':5, 'anticipate':5, 'an':15} # (i.e. inner key `1`)

I want to know how different/similar the documents are, so how do I get a single floating number as a distance score for the two documents?

How do I calculate the dot product across these two documents?

I've tried calculating a single value for each document by considering:

((2*0) + (27*0) + (3*1) + (1*1) + (0*1)) / (magnitude of file vector * magnitude of search phrase vector)

Using my code as such:

vecDist = {}
    for word in search:
        for fileNum in myDict.iteritems():
            vecDist[fileNum] = "dotproduct" / magnitudeFileVec[fileNum] * magnitudeSearchVec

解决方案

Firstly, your dictionary of dictionary is a nice start for what you're doing but it's too convoluted try using a numpy array:

import numpy as np

vocabulary = ['a', 'about', 'apple', 'anticipate', 'an']
matrix = [[2,27, 0, 1, 3], [163, 45, 5, 5, 15], [12, 21, 0, 0, 1], [67, 10, 10, 8, 312], [84, 15, 0, 7, 100]]

matrix = np.array(matrix)

print matrix 

[out]:

[[  2  27   0   1   3]
 [163  45   5   5  15]
 [ 12  21   0   0   1]
 [ 67  10  10   8 312]
 [ 84  15   0   7 100]]

Now you can clearly see that that you rows are documents and your columns are word counts.

To access the term/word vector (i.e. the column):

for i, term in enumerate(vocabulary):
    vector = matrix[:,i]
    print term, vector, vector.sum()

[out]:

a [  2 163  12  67  84] 328
about [27 45 21 10 15] 118
apple [ 0  5  0 10  0] 15
anticipate [1 5 0 8 7] 21
an [  3  15   1 312 100] 431

To access the document vector (i.e. the row):

for i, document in enumerate(matrix):
    print i, document

[out]:

0 [ 2 27  0  1  3]
1 [163  45   5   5  15]
2 [12 21  0  0  1]
3 [ 67  10  10   8 312]
4 [ 84  15   0   7 100]

To access individual row:

doc1 = matrix[0,:]
doc2 = matrix[1,:]

print doc1
print doc2

[out]:

[ 2 27  0  1  3]
[163  45   5   5  15]

To calculate sum of square values in a vector:

`np.sum(doc1**2)`

[out]:

743

To calculate the dot product between two vector, simply:

print np.dot(doc1, doc2)

[out]:

1591

If you're totally stuck with the nested dictionaries, here's how to convert it into a numpy array:

import numpy as np

myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84}, 
          'about': {1:27, 2:45, 3:21, 4:10, 5:15}, 
          'apple': {1:0, 2: 5, 3:0, 4:10, 5:0}, 
          'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7}, 
          'an': {1:3, 2:15, 3:1, 4:312, 5:100}}

vocabulary = myDict.keys()
matrix = [[myDict[i][j] for j in myDict[i]] for i in myDict]
matrix = np.array(matrix)
matrix = np.transpose(matrix)

print matrix

[out]:

[[  2  27   0   1   3]
 [163  45   5   5  15]
 [ 12  21   0   0   1]
 [ 67  10  10   8 312]
 [ 84  15   0   7 100]]

这篇关于从嵌套字典计算单词/文档向量之间的距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆