如何查找重复的列表值? [英] How to find duplicate list values?

查看:63
本文介绍了如何查找重复的列表值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个不寻常的任务.数据:

I have an unusual task. Data:

[(1566767777.0, 'Aaron Paul', 'dorety1', 'sfp_names', 'HUMAN_NAME', 100, 100, 0, '8ff7', '08f3', 'Human Name', 'ENTITY', '19fd', 0, 0),
 (1566767863.0, 'Aaron Paul', "{'username': 'aaronpaul', 'last_name': 'Paul', 'friends_count': 509, 'is_group': False, 'is_active': True, 'trust_request': None, 'phone': None, 'profile_picture_url': 'http, 'is_blocked': False, 'id': '1690', 'identity': None, 'date_joined': '2015-05-22T18:58:12', 'about': ' ', 'display_name': 'Aaron Paul', 'first_name': 'Aaron', 'friend_status': None, 'email': None}", 'sfp_names', 'HUMAN_NAME', 100, 100, 0, '7049', 'a458', 'Human Name', 'ENTITY', '19fd', 0, 0),
 (1566, 'Aaron Paul', 'Possible full name: Aaron Paul', 'sfp_names', 'HUMAN_NAME', 100, 100, 0, '6685', '235f', 'Human Name', 'ENTITY', '19fd', 0, 0),
 (1566767503.0, 'Antoine Griezmann', 'dorety', 'sfp_names', 'HUMAN_NAME', 100, 100, 0, '16ab', '08f3', 'Human Name', 'ENTITY', '19fd', 0, 0),
 (1566767108.0, 'Boris Johnson', 'dorety', 'sfp_names', 'HUMAN_NAME', 100, 100, 0, '7931', '08f3', 'Human Name', 'ENTITY', '19fd', 0, 0)]

我需要从重复[1]而没有重复[3]的turple中获取值.也就是说,在上面的数据中,我们总是具有相同的[3](sfp_names),在[1](A​​aron Paul)的几个结果中,也就是说,从该列表中我们应该只得到(1566767777.0, 'Aaron Paul', 'dorety1' , 'sfp_names', 'HUMAN_NAME', 100, 100, 0, '8ff7', '08f3', 'Human Name', 'ENTITY', '19fd', 0, 0)和另外两个命名为Aaron Paul的列表.由于通常没有多少列表,我们需要从这三个列表[['Aaron Paul', 'sfp_names']]中获得相同的值.但是,如果我们使用模块名称为sfp_names_2的第三个turple,则我们由于模块不同,需要已经获得两个值.[['Aaron Paul', 'sfp_names'], ['Aaron Paul', 'sfp_names_2']].

I need to get values ​​from the turples in which [1] is duplicated and [3] is not duplicated. That is, in the data above, we always have the same [3] (sfp_names), and in several results of [1](Aaron Paul), that is, from this list we should only get (1566767777.0, 'Aaron Paul', 'dorety1' , 'sfp_names', 'HUMAN_NAME', 100, 100, 0, '8ff7', '08f3', 'Human Name', 'ENTITY', '19fd', 0, 0) and two others that have the name Aaron Paul. Since we generally have no difference in how many lists it occurs. We need to get the same value from these three lists [['Aaron Paul', 'sfp_names']]. But if we had a third turple with the module name sfp_names_2, then we need to get two values ​​already, since the modules are different. [['Aaron Paul', 'sfp_names'], ['Aaron Paul', 'sfp_names_2']].

关于我自己所做的事情,这部分我什么都没想到.我只是有办法在列表中查找重复项.

Regarding what I did myself, nothing came to my mind on this part; I just have ways to find duplicates inside the list.

我了解我所描述的内容很难理解,因此我举了一些简单的示例说明其在下面的工作方式

I understand what I described is very difficult to understand, so I gave some simple examples of how it should work below

简单版本

数据:

[(0, 'Boby', 'beekeeper'), (1, 'Boby', 'beekeeper'), (2, 'Boby', 'beekeeper'), (3, 'Boby', 'gardener')]

结果:

['Boby', 'beekeeper']

数据:

[(0, 'Boby', 'beekeeper'), (1, 'Boby', 'beekeeper'), (2, 'Boby', 'beekeeper'), (3, 'Boby', 'gardener'), (4, 'Boby', 'gardener'), (5, 'Jack', 'gardener')]

结果:

[['Boby', 'beekeeper'], ['Boby', 'gardener']]

推荐答案

我不确定我是否正确理解您的情况:

I'm not completely sure if I understand you correctly:

您想获取列表中具有多次出现的条目集合的列表中的所有元素(元组)吗?!

You would like to get all elements (tuples) of a list that have a collection of entries occurring multiple times in your list?!

如果将itertools.groupbyoperator.itemgetter结合使用,则可以实现紧凑的实现. 这实际上导致单线表达:

A compact implementation can be realized if you combine itertools.groupby with the operator.itemgetter. This actually results in a one-liner expression:

from operator import itemgetter
from itertools import groupby

# how often must the pattern appear (redundancy)
# what indices determine the pattern (target_slots)
redundancy, target_slots = 2, (1, 2)

eg_data_2 =  [(0, 'Boby', 'beekeeper'), (1, 'Boby', 'beekeeper'), (2, 'Boby','beekeeper'), (3, 'Boby', 'gardener'), (4, 'Boby', 'gardener'), (5, 'Jack', 'gardener')]

targets = [k for k, v in groupby(eg_data_2, itemgetter(*target_slots)) if sum(1 for _ in v)>=redundancy]

targets
Out[6]: [('Boby', 'beekeeper'), ('Boby', 'gardener')]

对于原始数据(下面的orig_data),您将获得:

For your original data (orig_data below) you would get:

target_slots = [1,3]
targets = [k for k, v in groupby(orig_data, itemgetter(*target_slots)) if sum(1 for _ in v)>=redundancy]

In [9]: targets                                                           
Out[9]: [('Aaron Paul', 'sfp_names')]


或者,您可以单独使用itemetter运算符.的想法是使用元素集合作为键,其值是该特定集合出现在其中的元素索引列表然后,如果此列表长于您选择的阈值(下面的redundancy参数),我们将报告此特定集合:


As alternative, you can work with the itemetter operator alone. The idea is to use the collections of elements as a key with the value being a list of element indices this particular collections occurs in. Then, if this list is longer than whatever threshold you chose (the redundancy parameter below) we report this particular collection:

from operator import itemgetter
from collections import defaultdict

# how many times must the collection of elements appear
redundancy = 2
# what are the indices of the collection
target_slots = [1, 2] 

# the example data:
eg_data_2 =  [(0, 'Boby', 'beekeeper'), (1, 'Boby', 'beekeeper'), (2, 'Boby','beekeeper'), (3, 'Boby', 'gardener'), (4, 'Boby', 'gardener'), (5, 'Jack', 'gardener')]


occurences = defaultdict(list)  # this is just convenient, you can use a normal dict as well.
for i, entry in enumerate(eg_data_2):
    occurences[itemgetter(*target_slots)(entry)].append(i)
targets = [k for k,v in occurences.items() if len(v) >=redundancy]
targets
Out[18]: [('Boby', 'beekeeper'), ('Boby', 'gardener')]


如果您想要元素而不是重复的条目,则需要稍微修改targets的语句,因为sum(1...已经使用了组迭代器.


In case you want the elements rather than the repeated entries back, you need to slightly adapt the statement for the targets as the sum(1... will already consume the group iterator.

这是它的外观:

from operator import itemgetter
from itertools import groupby

redundancy, target_slots = 2, (1, 2)

eg_data_2 =  [(0, 'Boby', 'beekeeper'), (1, 'Boby', 'beekeeper'), (2, 'Boby','beekeeper'), (3, 'Boby', 'gardener'), (4, 'Boby', 'gardener'), (5, 'Jack', 'gardener')]

_targets = [(k, [e for e in v]) for k, v in groupby(eg_data_2, itemgetter(*target_slots))]
targets = [tg[1] for tg in _targets if len(tg[1]) >= redundancy]

哪个会给:

[ins] In [6]: targets                                                           
Out[6]: 
[[(0, 'Boby', 'beekeeper'),
  (1, 'Boby', 'beekeeper'),
  (2, 'Boby', 'beekeeper')],
 [(3, 'Boby', 'gardener'), (4, 'Boby', 'gardener')]]

这篇关于如何查找重复的列表值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆