如何设置一个值以计算欧几里得距离和相关性 [英] How to set a value's for calculating Eucludeian distance and correlation

查看:123
本文介绍了如何设置一个值以计算欧几里得距离和相关性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的单词向量:

google
test
stackoverflow
yahoo

我为这些单词分配了一个值,如下所示:

I have assigned a value for these words as follows :

google : 1
test : 2
stackoverflow : 3
yahoo : 4

以下是一些示例用户及其词:

Here are some sample users and their words :

user1   google, test , stackoverflow
user2   test , google
user3   test , yahoo
user4   stackoverflow , yahoo
user5   stackoverflow , google
user6

迎合那些没有价值的用户 我为'0'分配的字向量

To cater for users which do not have value contained in the word vector I assign '0'

基于此,它对应于:

user1   1, 2 , 3
user2   2 , 1 , 0
user3   2 , 4 , 0
user4   3 , 4 , 0
user5   3 , 1,  0
user6   0 , 0 , 0

我不确定这些值是否正确,甚至不确定将值应用于每个单词矢量值的正确方法,因此可以应用欧几里德距离"和相关性".我基于编程集体智慧"一书的摘录:

I am unsure if these are the correct values or even is correct approach for applying values to each word vector value so can apply 'Eucludeian distance' and 'correlation'. I'm basing this on snippet from book 'Programming Collective Intelligence' :

收集首选项"首先,您需要一种表示方式 不同的人和他们的喜好.如果您要建立一个 购物网站,则您可以使用值1来表示某人 过去购买了一件商品,值0表示他们有 不是. "

"Collecting Preferences The first thing you need is a way to represent different people and their preferences. If you were building a shopping site, you might use a value of 1 to indicate that someone had bought an item in the past and a value of 0 to indicate that they had not. "

对于我的数据集,我没有偏好值,因此我只是使用一个唯一的数值来表示用户是否在单词向量中包含单词.

For my dataset I do not have preference values so I am just using a unique numerical value to represent if a user contains a word in word vector or not.

为我的单词向量设置这些正确的值吗?我应该如何确定这些值应该是什么?

Are these the correct values to set for my word vector ? How should I determine what these values should be ?

推荐答案

要制定距离和相似性度量标准,您需要在词汇表中的每个单词一列,然后在布尔值中填充零,在出现相应单词时填充一列在样品中.例如

To make distance and similarity metrics work out, you need one column per word in the vocabulary, then fill those columns with booleans zero and one as the corresponding words occur in samples. E.g.

                                 G   T   SO  Y!
google, test, stackoverflow  =>  1,  1,  1,  0
test, google                 =>  1,  1,  0,  0
stackoverflow, yahoo         =>  0,  0,  1,  1

前两个向量之间的平方欧几里得距离现在是

The squared Euclidean distance between the first two vectors is now

(1 - 1)² + (1 - 1)² + (1 - 0)² + (0 - 0)² = 1

这很直观,因为矢量在一个位置上完全不同.同样,最后两个向量之间的平方距离是4,这是该空间中的最大平方距离.

which makes intuitive sense as the vectors differ in exactly one position. Similarly, the squared distance between the final two vectors is four, which is the maximal squared distance in this space.

此编码是单发"或千分之一"编码的扩展,并且是文本机器学习的主要内容(尽管很少有教科书希望将其拼写出来).

This encoding is an extension of the "one-hot" or "one-of-K" coding, and it's a staple of machine learning on text (although few textbooks care to spell it out).

这篇关于如何设置一个值以计算欧几里得距离和相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆