计算两个列表之间的相似性 [英] Computing similarity between two lists

查看:325
本文介绍了计算两个列表之间的相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑: 因为每个人都被弄糊涂,我想简化我的问题。我有两个有序列表。现在,我只想如何计算类似一个列表是另一个。

as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other.

例如,

1,7,4,5,8,9
1,7,5,4,9,6

什么是这两个列表之间的相似性一个很好的措施,这样的顺序是非常重要的。例如,我们应该惩罚相似度为4,5的交换两个列表?

What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists?

我有2个系统。艺术系统的一个状态,我实现的一个系统。给定的查询,这两个系统返回的文档的排名列表。现在,我想,以衡量我的系统的正确性比较我的系统和艺术体制的国家之间的相似性。请注意,文件的顺序很重要,因为我们是在谈论一个排名系统。 有谁知道任何措施,可以帮我看看这两个列表之间的相似性。

I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of documents. Now, I want to compare the similarity between my system and the "state of the art system" in order to measure the correctness of my system. Please note that the order of documents is important as we are talking about a ranked system. Does anyone know of any measures that can help me find the similarity between these two lists.

推荐答案

DCG [贴现累计收益]和< A HREF =htt​​p://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG> nDCG [标准化DCG]通常是一个很好的措施排列表。

The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.

它提供了完整的增益有关文件,如果它被排在第一位,增益随等级下降。

It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.

使用DCG / nDCG评价体系相比,SOA底线:

请注意:如果您设置通过为相关的技术系统的状态,返回的所有结果,那么你的系统的一样的现有技术的状态,如果他们:收到使用DCG / nDCG排名相同

Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.

因此​​,一个可能的评估可能是: DCG(your_system)/ DCG(state_of_the_art_system)

Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)

要进一步加强它,你可以给一个相关等级[相关性不会二进制] - 并会根据每个文档是如何排名的技术发展水平来确定。例如 rel_i = 1 /日志(1 + I)在艺术系统的状态每个文档。

To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i) for each document in the state of the art system.

如果用这个计算功能:收到的值接近1:您的系统是非常相似的底线

示例:

mySystem = [1,2,5,4,6,7]
stateOfTheArt = [1,2,4,5,6,9]

首先给得分每个文档,根据现有技术系统的状态[使用公式从上方]:

First you give score to each document, according to the state of the art system [using the formula from above]:

doc1 = 1.0
doc2 = 0.6309297535714574
doc3 = 0.0
doc4 = 0.5
doc5 = 0.43067655807339306
doc6 = 0.38685280723454163
doc7 = 0
doc8 = 0
doc9 = 0.3562071871080222

现在你算算 DCG(stateOfTheArt),并使用相关性上文所述[注意相关性不是二进制这里,并获得 DCG(stateOfTheArt )= 2.1100933062283396
其次,计算它为你的系统的使用相同的relecance权并获得: DCG(mySystem)= 1.9784040064803783

Now you calculate DCG(stateOfTheArt), and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396
Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783

因此​​,评价 DCG(mySystem)/ DCG(stateOfTheArt)= 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939

这篇关于计算两个列表之间的相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆