列表元素的相似性 [英] Similarity of list elements

查看:46
本文介绍了列表元素的相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有十个列表,我想了解它们的相似性".这是我的输入:

data = [['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'SeasonNumber', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'],['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'SeasonNumber', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'],['RuntimeInMinutes', 'Genres', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', '_StudioName', 'Type', 'LanguageOfMetadata', 'ReleaseDate', 'Studio', '_NetworkName', 'ReleaseYear, '_ContentProviderName', 'TVSeriesID', 'Locales', 'EpisodeNumber', 'Name', 'Synopsis', 'Products', 'SeasonNumber', 'Platform'],['RuntimeInMinutes', 'EpisodeNumber', '流派', 'ReleaseDate', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'LanguageOfMetadata', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'],['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', '产品', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'],['RuntimeInMinutes', 'Genres', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', '_StudioName', 'Type', 'LanguageOfMetadata', 'ReleaseDate', 'Studio', '_NetworkName', 'ReleaseYear, '_ContentProviderName', 'TVSeriesID', 'Locales', 'EpisodeNumber', 'Name', 'Synopsis', 'Products', 'SeasonNumber', 'Platform'],['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', '产品', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'],['RuntimeInMinutes', 'ReleaseDate', 'Genres', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'],['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', '产品', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'],['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', '产品', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales']]

我目前的方法是将这些值的集合的长度与总长度进行比较.所以在上面它会是:

<预><代码>>>>len(设置(数据))/len(数据)0.5

然而,这很粗糙,因为我想得到一个不是全有或全无"的相似性.换句话说,类似于概念上的相似性,其中上述内容可能有 98% 的相似性(对不起,如果我无法准确地解释我在这里想要什么——但我的意思是评估相似性不仅仅是列表本身,而是其元素的相似性.

解决方案

查看数据草图库及其MinHash数据结构.这是基于集合的 Jaccard 相似性,这只是交集(它们在通用)除以联合(所有可能的元素).

这是一个例子:

from datasketch import MinHashdata1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for','估计'、'the'、'相似性'、'之间'、'数据集']data2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for','估计', 'the', '相似性', 'between', 'documents']m1, m2 = MinHash(), MinHash()对于数据 1 中的 d:m1.update(d.encode('utf8'))对于数据 2 中的 d:m2.update(d.encode('utf8'))打印(数据1和数据2的估计Jaccard是",m1.jaccard(m2))

如果您的集合很大,则为您提供相似性的估计.否则,只需使用内置的集合操作:

s1 = set(data1)s2 = 设置(数据 2)actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2)))print("data1 和 data2 的实际 Jaccard 是", actual_jaccard)

如果要取两个以上集合的 Jaccard 相似度,只需计算成对比较并取所有值的均值(平均值):

from datasketch import *导入迭代工具minhash_data = list()对于数据中的元素:m = MinHash()对于元素中的 d:m.update(d.encode('utf-8'))minhash_data.append(m)jaccard_sims = 列表()对于 itertools.combinations(minhash_data, 2) 中的配对:jaccard_sims.append(pair[0].jaccard(pair[1]))平均值 = 总和(jaccard_sims)/浮点数(len(jaccard_sims))打印(平均杰卡德相似度:{}".格式(平均))

<块引用>

平均 Jaccard 相似度:0.9512152777777778

I have ten lists and I want to get their 'similarity'. Here is the input I have:

data = [
    ['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'SeasonNumber', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'], 
    ['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'SeasonNumber', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'], 
    ['RuntimeInMinutes', 'Genres', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', '_StudioName', 'Type', 'LanguageOfMetadata', 'ReleaseDate', 'Studio', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'TVSeriesID', 'Locales', 'EpisodeNumber', 'Name', 'Synopsis', 'Products', 'SeasonNumber', 'Platform'], 
    ['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'LanguageOfMetadata', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'], 
    ['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'], 
    ['RuntimeInMinutes', 'Genres', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', '_StudioName', 'Type', 'LanguageOfMetadata', 'ReleaseDate', 'Studio', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'TVSeriesID', 'Locales', 'EpisodeNumber', 'Name', 'Synopsis', 'Products', 'SeasonNumber', 'Platform'], 
    ['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'], 
    ['RuntimeInMinutes', 'ReleaseDate', 'Genres', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'], 
    ['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales'], 
    ['RuntimeInMinutes', 'EpisodeNumber', 'Genres', 'ReleaseDate', 'Name', 'Platform', 'PlatformID', 'BaseURL', 'Languages', 'ArtworkURL', 'Synopsis', 'TVSeriesID', 'Products', '_NetworkName', 'ReleaseYear', '_ContentProviderName', 'Studio', '_StudioName', 'Type', 'Locales']
]

My current method is to compare the length of the set of those values with the total length. So in the above it would be:

>>> len(set(data))/len(data)
0.5

However this is quite crude, as I'd like to get a similarity that isn't "all-or-nothing". In other words, something like a conceptual similarity, where the above may have a 98% similarity (sorry if I'm having trouble explaining exactly what I want here -- but I mean to evaluate similarity as not just the list itself, but the similarity of its elements.

解决方案

Take a look at the datasketch library and its MinHash data structure. This is based on the Jaccard Similarity for sets, which is just the intersection (what they have in common) divided by the union (all possible elements).

Here is an example:

from datasketch import MinHash

data1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'datasets']
data2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'documents']

m1, m2 = MinHash(), MinHash()
for d in data1:
    m1.update(d.encode('utf8'))
for d in data2:
    m2.update(d.encode('utf8'))
print("Estimated Jaccard for data1 and data2 is", m1.jaccard(m2))

Gives you an estimation of the similarity, if your sets are large. Otherwise, just use the built-in set operations:

s1 = set(data1)
s2 = set(data2)
actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2)))
print("Actual Jaccard for data1 and data2 is", actual_jaccard)

If you want to take the Jaccard similarity of more than two sets, just compute the pairwise-comparison and take the mean (average) of all the values:

from datasketch import *
import itertools
minhash_data = list()
for element in data:
    m = MinHash()
    for d in element:
        m.update(d.encode('utf-8'))
    minhash_data.append(m)

jaccard_sims = list()
for pair in itertools.combinations(minhash_data, 2):
    jaccard_sims.append(pair[0].jaccard(pair[1]))

average = sum(jaccard_sims) / float(len(jaccard_sims))
print("Average Jaccard similarity: {}".format(average))

Average Jaccard similarity: 0.9512152777777778

这篇关于列表元素的相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆