数据集中的成对比较 [英] pairwise comparisons within a dataset

查看:141
本文介绍了数据集中的成对比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据是18个向量,每个向量最多有200个数字,但有些却有5个或其他数字..组织为:

My data is 18 vectors each with upto 200 numbers but some with 5 or other numbers.. organised as:

[2, 3, 35, 63, 64, 298, 523, 624, 625, 626, 823, 824]
[2, 752, 753, 808, 843]
[2, 752, 753, 843]
[2, 752, 753, 808, 843]
[3, 36, 37, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, ...]

我想在这组列表中找到最相似的那对.数字本身并不重要,它们也可能是字符串-一个列表中的2和另一个列表中的3是不可比较的.

I would like to find the pair that is the most similar in this group of lists. The numbers themselves are not important, they may as well be strings - a 2 in one list and a 3 in another list are not comparable.

我正在查看变量是否相同.例如,第二个列表与第四个列表完全相同,但是只有一个变量与列表3不同.

I am looking if the variables are the same. for example, the second list is exactly the same as the 4th list but only 1 variable different from list 3.

此外,最好找到最相似的三元组或最相似的n,但成对是第一个也是最重要的任务.

Additionally it would be nice to also find the most similar triplet or n that are the most similar, but pairwise is the first and most important task.

我希望我已经足够清楚地阐明了这个问题,但是我很高兴提供任何其他人可能需要的信息!

I hope i have layed out this problem clear enough but i am very happy to supply any more information that anyone might need!

我感觉它涉及到numpy或scipy规范/余弦计算,但是我无法完全确定该怎么做,或者这是否是最好的方法.

I have a feeling it involves numpy or scipy norm/cosine calculations, but i cant quite work out how to do it, or if this is the best method.

任何帮助将不胜感激!

推荐答案

您可以使用itertools生成成对比较.如果只想在两个列表之间共享项目,则可以使用set交集.以您的示例为例:

You can use itertools to generate your pairwise comparisons. If you just want the items which are shared between two lists you can use a set intersection. Using your example:

import itertools

a = [2, 3, 35, 63, 64, 298, 523, 624, 625, 626, 823, 824]
b = [2, 752, 753, 808, 843]
c = [2, 752, 753, 843]
d = [2, 752, 753, 808, 843]
e = [3, 36, 37, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112]

data = [a, b, c, d, e]

def number_same(a, b):
    # Find the items which are the same
    return set(a).intersection(set(b))

for i in itertools.permutations([i for i in range(len(data) - 1)], r=2):
    print "Indexes: ", i, len(number_same(data[i[0]], data[i[1]]))

>>>Indexes  (0, 1) 1
Indexes  (0, 2) 1
Indexes  (0, 3) 1
Indexes  (1, 0) 1
Indexes  (1, 2) 4
Indexes  (1, 3) 5  ... etc 

这将给出两个列表之间共享的项目数,您也许可以使用此信息来定义哪两个列表是最好的一对...

This will give the number of items which are shared between two lists, you could maybe use this information to define which two lists are the best pair...

这篇关于数据集中的成对比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆