如何按包含元素的相似性对集合进行分组 [英] How to group sets by similarity in contained elements

查看:73
本文介绍了如何按包含元素的相似性对集合进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Python 2.7.我有由相互连接的节点数组组成的路由.节点由字符串键标识,但为方便起见,我将使用数字:

I am using Python 2.7. I have routes which are composed of arrays of nodes that connect to each other. The nodes are identified by a string key, but for ease I will use numbers:

sample_route = [1,2,3,4,7] 
#obviously over-simplified; real things would be about 20-40 elements long

我将使用 zip 创建一个由点连接的元组对组成的 set,最终会像:

I will create a set made up of tuple pairs of point connections using zip, which will end up like:

set([(1,2),(2,3),(3,4),(4,7)])

我需要一种方法来过滤掉一些非常相似的路由(比如一个或两个添加的节点),并使用那些接近重复的最小路由.我现在的计划是:

I will need a way to filter out some routes that are very similar (like one or two added nodes) and use the minimal route out of those near-duplicates. My plan right now is to:

从第一条(可能是最佳的)路线开始.遍历其余的路线,并使用以下公式计算其与步骤 1 中的序列的相似度:

Start with the first (probably optimal) route. Iterate through the rest of the routes, and use the following formula to calculate its similarity to the sequence in step 1:

matching = len(value1.difference(value2)) + len(value2.difference(value1))
#value1, value2 = two compared sets

数字越小,越相似.但是根据路由与其他路由的相似性对路由进行分组的好方法是什么?它们的长度都不同.我从未参加过统计学课程.

The lower the number, the more similar. But what is a good way to group the routes based on their similarity to other routes? They will all be different lengths. I have never taken a statistics course.

示例:

sets = [
  set([(1,2),(2,3),(3,4),(4,5),(5,10)]),
  set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),
  set([(1,7),(7,3),(3,8),(8,7),(7,6),(6,5),(5,10)]),
  set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),
  set([(1,9),(9,4),(4,5),(5,10)]),
  set([(1,9),(9,4),(4,6),(6,5),(5,10)])
]

在这个例子中,分组可能类似于 [[1,2,4],[3],[5,6]],其中 1、2 和 4 非常相似, 5 和 6 是相似的,而 3 与其他任何一个都不相近.例如,1 到 2 的分数为 2,而 3 到 6 的分数为 8.这是我正在使用的数据类型(尽管这些简化了易于阅读).

In this example, the groupings may be something like [[1,2,4],[3],[5,6]], where 1, 2, and 4 are very similar, 5 and 6 are similar, and 3 is nowhere near any of the others. 1 to 2 would have a score of 2, and 3 to 6 would have a score of 8, as examples. This is the sort of data I am using (though these are easy to read simplifications).

这样做有时间上的好处.如果我能去掉多余的路线,我会削减相当多的时间.

There is a time benefit in this. If I can remove redundant routes, I will trim off a considerable amount of time.

推荐答案

我建议查看 networkx 包.它允许您创建有向图,例如您所描述的.为了衡量 2 条路线的相似度,我会推荐 Jaccard 相似度指数.这是显示您所说明的示例的代码.

I would recommend looking into the networkx package. It allows you to create directed graphs such as you are describing. For measuring the similarity of 2 routes, I would recommend the Jaccard similarity index. Here is code showing the example you illustrated.

首先,导入几个库:graphs、plotting 和 numeric python.然后通过添加编号为 1 到 8 的节点来构建有向图.构建节点到节点的连接以构建路径.networkx 包具有查找图中从一个节点到另一个节点的所有路径的内置功能:nx.all_simple_paths(g, start_node, end_node).

First, import a few libraries: graphs, plotting, and numeric python. Then build the directed graph by adding nodes numbered 1 to 8. Build the connections from node to node to build your path. The networkx package has built-in capability to find all paths in the graph from one node to another: nx.all_simple_paths(g, start_node, end_node).

拥有所有路径后,您可以计算路径之间 Jaccard 相似度的矩阵 J.您实际上希望如何通过路径的相似性对路径进行聚类取决于您.

Once you have all of the paths, you can compute the a matrix, J, of the Jaccard similarity between the paths. How you actually want to cluster the paths by their similarity is up to you.

import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

g = nx.DiGraph()
g.add_nodes_from(range(1,8))
g.add_edges_from([(1,2),(2,3),(3,4),(4,7)]) #path 1,2,3,4,7
g.add_edges_from([(4,5),(5,7)])             #path 1,2,3,4,5,7
g.add_edges_from([(4,6),(6,7)])             #path 1,2,3,4,6,7
paths_iter = nx.all_simple_paths(g,1,7)
paths = [p for p in paths]

np.random.seed(100000)
nx.draw_spring(g, with_labels=True)
plt.show()

def jaccard(v1, v2):
    return (len(np.intersect1d(v1,v2))+0.0)/len(np.union1d(v1,v2))

J = np.zeros([len(paths),len(paths)])
for i in range(J.shape[0]):
    for j in range(i, J.shape[1]):
        J[i,j] = J[j,i] = jaccard(paths[i],paths[j])

print J

> [[ 1.          0.71428571  0.83333333]
>  [ 0.71428571  1.          0.83333333]
>  [ 0.83333333  0.83333333  1.        ]]

编辑由于您有一个度量来比较路径之间的相似性,因此请查看 k 均值聚类以将路径聚类在一起.

EDIT Since you have a metric to compare the similarity of the paths to each other, look into k-means clustering to cluster the paths together.

from scipy.cluster.vq import kmeans2

从现在开始,我没有足够的代码或数据来提供帮助.

I don't have enough of your code or your data to help anymore from this point.

这篇关于如何按包含元素的相似性对集合进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆