如何将匹配对聚集到“连接的组件”中。在Python中 [英] How to aggregate matching pairs into "connected components" in Python

查看:97
本文介绍了如何将匹配对聚集到“连接的组件”中。在Python中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有许多公司的董事数据,但有时 XYZ董事约翰·史密斯和 ABC董事约翰·史密斯是同一个人,有时候却不是。另外, XYZ董事约翰·史密斯和美国广播公司董事约翰·史密斯可能是同一个人,也可能不是同一个人。通常,对附加信息进行检查(例如,比较 XYZ主任约翰·史密斯和 ABC主任约翰·史密斯的传记数据)就可以确定两个观察是否是同一个人。

I have data on directors across many firms, but sometimes "John Smith, director of XYZ" and "John Smith, director of ABC" are the same person, sometimes they're not. Also "John J. Smith, director of XYZ" and "John Smith, director of ABC" might be the same person, or might not be. Often examination of additional information (e.g., comparison of biographical data on "John Smith, director of XYZ" and "John Smith, director of ABC") makes it possible to resolve whether two observations are the same person or not.

本着这种精神,我正在收集可识别匹配对的数据。例如,假设我有以下匹配对: {(a,b),(b,c),(c,d),(d,e),(f,g)} 。我想使用关系与...是同一个人的传递性属性来生成 {{a,b,c,d,e},{f,g}} 。即 {a,b,c,d,e} 是一个人,而 {f,g} 是另一个人。 (该问题的较早版本称为 cliques,显然还有其他含义;这可以解释为什么 networkx find_cliques

In that spirit, am collecting data that will identify matching pairs. For example, suppose I have the following matching pairs: {(a, b), (b, c), (c, d), (d, e), (f, g)}. I want to use the transitivity property of the relation "is the same person as" to generate "connected components" of {{a, b, c, d, e}, {f, g}}. That is {a, b, c, d, e} is one person and {f, g} is another. (An earlier version of the question referred to "cliques", which are apparently something else; this would explain why find_cliques in networkx was giving the "wrong" results (for my purposes).

下面的Python代码可以完成这项工作,但是我想知道:有没有更好的方法(计算成本更低) )方法(例如,使用标准库或可用库)?

The following Python code does the job. But I wonder: is there a better (less computationally costly) approach (e.g., using standard or available libraries)?

这里有些地方似乎很相关(例如, Python中的小语),但是它们并不完整,因此我不确定它们所指的是什么库或如何设置我的数据以使用它们。 / p>

示例Python 2代码:



There are examples here and there that seem related (e.g., Cliques in python), but these are incomplete, so I am not sure what libraries they are referring to or how to set up my data to use them.

def get_cliques(pairs):
    from sets import Set

    set_list = [Set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for set in set_list:
            if pair[0] in set or pair[1] in set:
                set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(Set(pair))

    return set_list

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

print(get_cliques(pairs))

这会产生所需的输出: [Set(['a','c','b' ,'e','d']),Set(['g','f'])]

This produces the desired output: [Set(['a', 'c', 'b', 'e', 'd']), Set(['g', 'f'])].

这将生成 [set(['a','c','b','e','d']) ,set(['g','f'])] ):

def get_cliques(pairs):

    set_list = [set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for a_set in set_list:
            if pair[0] in a_set or pair[1] in a_set:
                a_set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(set(pair))

    return set_list

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

print(get_cliques(pairs))


推荐答案

使用networkX:

With networkX:

import networkx as nx
G1=nx.Graph()
G1.add_edges_from([("a","b"),("b","c"),("c","d"),("d","e"),("f","g")])
sorted(nx.connected_components(G1), key = len, reverse=True)

giving:

[['a', 'd', 'e', 'b', 'c'], ['f', 'g']]

您必须检查最快的现在使用算法...

You have to check the fastest algorithm now ...

OP:

这很好用!我现在在我的PostgreSQL数据库中。只需将对组织到一个两列的表中,然后使用 array_agg()传递给PL / Python函数 get_connected()

This works great! I have this in my PostgreSQL database now. Just organize pairs into a two-column table, then use array_agg() to pass to PL/Python function get_connected(). Thanks.

CREATE OR REPLACE FUNCTION get_connected(
    lhs text[],
    rhs text[])
  RETURNS SETOF text[] AS
$BODY$
    pairs = zip(lhs, rhs)

    import networkx as nx
    G=nx.Graph()
    G.add_edges_from(pairs)
    return sorted(nx.connected_components(G), key = len, reverse=True)

$BODY$ LANGUAGE plpythonu;

(注意:我编辑了答案,因为我认为显示此步骤可能对附录有帮助,但时间太长进行评论。)

(Note: I edited answer, as I thought showing this step might be helpful addendum, but too long for a comment.)

这篇关于如何将匹配对聚集到“连接的组件”中。在Python中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆