如何计算特征列表之间的相似度? [英] How to compute the similarity between lists of features?

查看:1138
本文介绍了如何计算特征列表之间的相似度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有用户和资源.每个资源由一组功能描述,每个用户与一组不同的资源相关.在我的特定情况下,资源是网页,并且有关访问位置,访问时间,访问次数等的功能信息每次都与特定用户相关联.

I have users and resources. Each resource is described by a set of features and each user is related to a different set of resources. In my particular case, the resources are web pages, and the features information about the location of the visit, the time of the visit, the number of visit etc, which are tied to a specific user each time.

我想在用户之间就这些功能进行相似性度量,但是我找不到将资源功能汇总到一起的方法.我已经使用文本功能完成了此操作,因为可以将文档一起添加,然后提取功能(例如TF-IDF),但是我不知道如何进行此配置.

I want to get a similarity measure between my users regarding those features but I can't find a way to aggregate the resource features together. I've done it with text features, as it is possible to add the documents together and then extract features (say TF-IDF), but I don't know how to proceed with this configuration.

为了清楚起见,这是我所拥有的:

To be as clear as possible, here is what I have:

>>> len(user_features)
13 # that's my number of users
>>> user_features[0].shape
(2374, 17) # 2374 documents for this user, and 17 features

例如,我可以使用欧式距离获得文档的相似度矩阵 :

I'm able to get a similarity matrix of the documents using euclidean distances for instance:

>>> euclidean_distance(user_features[0], user_features[0])

但是我不知道如何将用户彼此进行比较.我应该以某种方式将这些功能汇总在一起,最后得到一个N_Users X N_Features矩阵,但是我不知道如何.

But I don't know how do I compare the users against each other. I should somehow aggregate the features together to end up with a N_Users X N_Features matrix, but I don't know how.

关于如何进行操作的任何提示?

Any hints on how to proceed?

有关我正在使用的功能的更多信息:

Some more information about the features I'm using:

我在此处具有的功能尚未完全修复.到目前为止,我已经获得了13种不同的功能,这些功能已经从视图"中汇总了.我所拥有的是每个视图的标准差,均值等,以便具有某种平坦"的特征,以便能够对其进行比较.我拥有的功能之一是:自上次查看以来位置是否已更改?大约一个小时前呢?两个小时前?

The features I have here are not completely fixed. What I've got so far is 13 different features, already aggregated from "views". What I have is standard deviation, mean, etc. for each of the views, in order to have something "flat", to be able to compare them. One of the feature I have is: was the location changed since the last view? And what about one hour ago? Two hours ago?

推荐答案

如果每个用户都表示为一组文档交互向量,则可以将一对用户的相似性定义为一对文档交互的相似性代表用户的向量集.

If each user is represented as a set of document-interaction vectors you can define the similarity of a pair of users as the similarity of the pair of document-interaction vector sets that represent the users.

您说您可以获得文档的相似度矩阵.然后,假设用户U1访问了文档D1,D2,D3,并且用户U2访问了文档D1,D3,D4.对于用户1,您将有两组向量S1 = {U1(D1),U1(D2),U1(D3)},而S2 = {U2(D1),U2(D3),U2(D4)}.请注意,由于每个用户与文档的交互都是不同的,因此它们以这种方式表示.如果我理解正确,那么这些集合的元素应该对应于每个用户矩阵中的相应行.

You say you can get a similarity matrix of the documents. Then assume that user U1 visited documents D1, D2, D3, and user U2 visited documents D1,D3,D4. You would have two sets of vectors S1 = {U1(D1), U1(D2), U1(D3)} for user 1 and S2 = {U2(D1), U2(D3), U2(D4)}. Note that because each user's interaction with a document is different they are represented as such. If I understand correctly, the elements of these sets should correspond to the respective lines in the matrix of each user.

可以用许多不同的方法来计算这两个集合之间的相似度.一种选择是平均逐对相似度:对每个集合中的所有元素对进行迭代,计算对的文档相似度,然后对所有对取平均值.

The similarity between these two sets can be computed in many different ways. One option is the average pair-wise similarity: You iterate over all pairings of the elements from each set, compute the document similarity of the pair, and average over all pairs.

这篇关于如何计算特征列表之间的相似度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆