筛选字典列表,以根据另一个键删除键中的重复项 [英] Filter a list of dictionaries to remove duplicates within a key, based on another key

查看:56
本文介绍了筛选字典列表,以根据另一个键删除键中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个试图重复数据删除"的Python 3.5.2词典列表.所有字典都是唯一的,但是我想对特定的键进行重复数据删除,以使字典中的非零值保持最大.

I have a list of dictionaries in Python 3.5.2 that I am attempting to "deduplicate". All of the dictionaries are unique, but there is a specific key I would like to deduplicate on, keeping the dictionary with the most non-null values.

例如,我有以下词典列表:

For example, I have the following list of dictionaries:

d1 = {"id":"a", "foo":"bar", "baz":"bat"}
d2 = {"id":"b", "foo":"bar", "baz":None}
d3 = {"id":"a", "foo":"bar", "baz":None}
d4 = {"id":"b", "foo":"bar", "baz":"bat"}
l = [d1, d2, d3, d4]

我想将 l 过滤为仅具有唯一 id 键的字典,同时保留空位数最少的字典.在这种情况下,该函数应保留 d1 d4 .

I would like to filter l to just dictionaries with unique id keys, keeping the dictionary that has the fewest nulls. In this case the function should keep d1 and d4.

我试图为值计数"创建一个新的键,值对,如下所示:

What I attempted was to create a new key,val pair for "value count" like so:

for d in l:
    d['val_count'] = len(set([v for v in d.values() if v]))

现在我要坚持的是如何过滤我的字典列表以查找唯一的 ids ,其中 val_count 键是更大的值.

now what I am stuck on is how to go about filtering my list of dicts for unique ids where the val_count key is the greater value.

我对其他方法持开放态度,但由于资源限制,我无法在项目中使用 pandas .

I am open to other approaches, but I am unable to use pandas for this project due to resource constraints.

预期输出:

l = [{"id":"a", "foo":"bar", "baz":"bat"},
 {"id":"b", "foo":"bar", "baz":"bat"}]

推荐答案

我会使用

I would use groupby and just pick the first one from each group:

1)首先按键(创建组)和空值递减计数(您指定的目标)对列表进行排序:

1) First sort your list by key (to create the groups) and descending count of nulls (your stated goal):

>>> l2=sorted(l, key=lambda d: (d['id'], -sum(1 for v in d.values() if v))) 

2)然后按 id 进行分组,并在排序依据上的groupby中采用呈现为 d 的每个迭代器的第一个元素:

2) Then group by id and take the first element of each iterator presented as d in the groupby on the sorted list:

>>> from itertools import groupby
>>> [next(d) for _,d in groupby(l2, key=lambda _d: _d['id'])]
[{'id': 'a', 'foo': 'bar', 'baz': 'bat'}, {'id': 'b', 'foo': 'bar', 'baz': 'bat'}]

如果您希望平局者"选择第一个字典,否则它们具有相同的空计数,则可以添加枚举修饰符:

If you want a 'tie breaker' to select the first dict if otherwise they have the same null count, you can add an enumerate decorator:

>>> l2=sorted(enumerate(l), key=lambda t: (t[1]['id'], t[0], -sum(1 for v in t[1].values() if v)))
>>> [next(d)[1] for _,d in groupby(l2, key=lambda t: t[1]['id'])]

我怀疑实际上 是否需要采取额外的步骤,因为Python的排序(和 sorted )是

I doubt that additional step is actually necessary though since Python's sort (and sorted) is a stable sort and the sequence will only change from list order based on the key and void counts. So use the first version unless you are sure you need to use the second.

这篇关于筛选字典列表,以根据另一个键删除键中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆