计算两个表之间的相似性 [英] Compute the similarity between two lists

查看:206
本文介绍了计算两个表之间的相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算不同长度的两个列表之间的相似性。

I'd like to compute the similarity between two lists of various lengths.

例如:

listA = ['apple', 'orange', 'apple', 'apple', 'banana', 'orange'] # (length = 6)
listB = ['apple', 'orange', 'grapefruit', 'apple'] # (length = 4)

你可以看到,一个项目可以在列表中出现多次,而且长度不同的尺寸。

as you can see, a single item can appear multiple times in a list, and the lengths are of different sizes.

我已经觉得比较每个项目的频率,但是,这并不包括每个列表的大小(一个列表,只是两次另一份名单应该是相似的,但不完全相似)

I've already thought of comparing the frequencies of each item, but that does not encompass the size of each list (a list that is simply twice another list should be similar, but not perfectly similar)

EG2:

listA = ['apple', 'apple', 'orange', 'orange']
listB = ['apple', 'orange']
similarity(listA, listB) # should NOT equal 1

因此​​,我基本上要涵盖列表的大小,和项目的列表中的分布。

So I basically want to encompass the size of the lists, and the distribution of items in the list.

任何想法?

推荐答案

使用<一个href="http://docs.python.org/2/library/collections.html#collections.Counter"><$c$c>collections.Counter()也许;这些都是多套或袋,在数据类型的说法:

Use collections.Counter() perhaps; those are multi-sets, or bags, in datatype parlance:

from collections import Counter

counterA = Counter(listA)
counterB = Counter(listB)

现在,你可以通过输入或频率比较这些:

Now you can compare these by entries or frequencies:

>>> counterA
Counter({'apple': 3, 'orange': 2, 'banana': 1})
>>> counterB
Counter({'apple': 2, 'orange': 1, 'grapefruit': 1})
>>> counterA - counterB
Counter({'orange': 1, 'apple': 1, 'banana': 1})
>>> counterB - counterA
Counter({'grapefruit': 1})

您可以通过计算它们的余弦相似:

You can calculate their cosine similarity using:

import math

def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

其中给出:

>>> counter_cosine_similarity(counterA, counterB)
0.8728715609439696

越接近1该值越相近的两个列表是

The closer to 1 that value, the more similar the two lists are.

余弦相似性的一个的分数,你可以计算出。如果你关心的列表的长度,就可以计算出另一个;如果保持这一成绩在0.0和1.0之间的,以及你可以乘这两个值-1.0和1.0之间的最终得分。

The cosine similarity is one score you can calculate. If you care about the length of the list, you can calculate another; if you keep that score between 0.0 and 1.0 as well you can multiply the two values for a final score between -1.0 and 1.0.

例如,采取相对长度考虑在内,你可以使用:

For example, to take relative lengths into account you could use:

def length_similarity(c1, c2):
    lenc1 = sum(c1.itervalues())
    lenc2 = sum(c2.itervalues())
    return min(lenc1, lenc2) / float(max(lenc1, lenc2))

,然后组合成一个函数,它的列表作为输入:

and then combine into a function that takes the lists as inputs:

def similarity_score(l1, l2):
    c1, c2 = Counter(l1), Counter(l2)
    return length_similarity(c1, c2) * counter_cosine_similarity(c1, c2)  

有关你的两个例子列出,那结果是:

For your two example lists, that results in:

>>> similarity_score(['apple', 'orange', 'apple', 'apple', 'banana', 'orange'], ['apple', 'orange', 'grapefruit', 'apple'])
0.5819143739626463
>>> similarity_score(['apple', 'apple', 'orange', 'orange'], ['apple', 'orange'])
0.4999999999999999

您可以在其他指标根据需要混用。

You can mix in other metrics as needed.

这篇关于计算两个表之间的相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆