在 Python 元组列表中查找重复项 [英] Find duplicate items within a list of list of tuples Python

查看:29
本文介绍了在 Python 元组列表中查找重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从下面给定的列表中找到匹配的项目.我的列表可能超级大.

I want to find the matching item from the below given list.My List may be super large.

元组N1_10"中的第一项被复制并与另一个数组中的另一项匹配

The very first item in the tuple "N1_10" is duplicated and matched with another item in another array

ListA 中第一个数组中的元组('N1_10', 'N2_28')
ListA 中第二个数组中的元组 ('N1_10', 'N3_98')

tuple in 1st array in the ListA ('N1_10', 'N2_28')
tuple in 2nd array in the ListA ('N1_10', 'N3_98')

ListA  = [[('N1_10', 'N2_28'), ('N1_35', 'N2_44')],
          [('N1_22', 'N3_72'), ('N1_10', 'N3_98')],
          [('N2_33', 'N3_28'), ('N2_55', 'N3_62'), ('N2_61', 'N3_37')]]

我想要的输出是

output --> [('N1_10','N2_28','N3_98') , .... 和其他任何匹配的键将进入相同的元组]

output --> [('N1_10','N2_28','N3_98') , .... and the rest whatever match one of the key will get into same tuple]

如果你们认为,改变 ListA 的数据结构是更好的选择,请随时提出建议!感谢您的帮助!

If you guys think , changing the data structure of the ListA is better option , pls feel free to advise! Thanks for helping out!

简化版

列表 A = [[(a,x),(b,k),(c,l),(d,m)],[(e,d),(a,p),(g,s)],[...],[...]....]

List A = [[(a,x),(b,k),(c,l),(d,m)],[(e,d),(a,p),(g,s)],[...],[...]....]

wantedOutput --> [(a,x,p),(b,k),(c,l),(d,m,e),(g,s).....]

wantedOutput --> [(a,x,p),(b,k),(c,l),(d,m,e),(g,s).....]

推荐答案

更新:重新阅读您的问题后,您似乎正在尝试创建等价类,而不是为键收集值.如果

Update: After rereading your question, it appears that you're trying to create equivalence classes, rather than collecting values for keys. If

[[(1, 2), (3, 4), (2, 3)]]

应该变成

[(1, 2, 3, 4)]

,那么您将需要将输入解释为图形并应用连通分量算法.您可以将您的数据结构转换为 邻接列表 表示并以广度优先或深度遍历它-首先搜索或遍历您的列表并构建不相交集.无论哪种情况,您的代码都会突然涉及大量与图形相关的复杂性,并且很难根据输入的顺序提供任何输出顺序保证.这是一个基于广度优先搜索的算法:

, then you're going to need to interpret your input as a graph and apply a connected components algorithm. You could turn your data structure into an adjacency list representation and traverse it with a breadth-first or depth-first search, or iterate over your list and build disjoint sets. In either case, your code is going to suddenly involve a lot of graph-related complexity, and it'll be hard to provide any output ordering guarantees based on the order of the input. Here's an algorithm based on a breadth-first search:

import collections

# build an adjacency list representation of your input
graph = collections.defaultdict(set)
for l in ListA:
    for first, second in l:
        graph[first].add(second)
        graph[second].add(first)

# breadth-first search the graph to produce the output
output = []
marked = set() # a set of all nodes whose connected component is known
for node in graph:
    if node not in marked:
        # this node is not in any previously seen connected component
        # run a breadth-first search to determine its connected component
        frontier = set([node])
        connected_component = []
        while frontier:
            marked |= frontier
            connected_component.extend(frontier)

            # find all unmarked nodes directly connected to frontier nodes
            # they will form the new frontier
            new_frontier = set()
            for node in frontier:
                new_frontier |= graph[node] - marked
            frontier = new_frontier
        output.append(tuple(connected_component))

但是,不要在不理解的情况下复制它;了解它在做什么,或者编写自己的实现.您可能需要能够维护这一点.(我会使用伪代码,但 Python 实际上已经和伪代码一样简单了.)

Don't just copy this without understanding it, though; understand what it's doing, or write your own implementation. You'll probably need to be able to maintain this. (I would've used pseudocode, but Python is practically as simple as pseudocode already.)

如果我对您问题的原始解释是正确的,并且您的输入是您想要聚合的键值对的集合,那么这是我的原始答案:

In case my original interpretation of your question was correct, and your input is a collection of key-value pairs that you want to aggregate, here's my original answer:

原答案

import collections

clusterer = collections.defaultdict(list)

for l in ListA:
    for k, v in l:
        clusterer[k].append(v)

output = clusterer.values()

defaultdict(list) 是一个 dict,它会自动创建一个 list 作为任何不存在的键的值.循环遍历所有元组,收集与同一键匹配的所有值,然后从 defaultdict 创建一个 (key, value_list) 对的列表.

defaultdict(list) is a dict that automatically creates a list as the value for any key that wasn't already present. The loop goes over all the tuples, collecting all values that match up to the same key, then creates a list of (key, value_list) pairs from the defaultdict.

(这段代码的输出和你指定的形式不太一样,但我相信这个形式更有用.如果你想改变形式,那应该是一件简单的事情.)

(The output of this code is not quite in the form you specified, but I believe this form is more useful. If you want to change the form, that should be a simple matter.)

这篇关于在 Python 元组列表中查找重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆