Python-按索引将列表中的重复项分组 [英] Python - Group duplicates in a list of lists by index

查看:358
本文介绍了Python-按索引将列表中的重复项分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到了很多有关从列表中删除重复项并对其进行计数的问题.但我正在尝试找到对它们进行分组的最佳方法-以获得列表列表.

I've seen a lot of questions about removing duplicates from a list and counting them. But I'm trying to find the best way to group them - for a list of lists.

鉴于此示例,我想按第三字段进行分组:

Given this example I want to group by the third field:

[[1, "text", "name1", "text"],
 [2, "text", "name2", "text"],
 [3, "text", "name2", "text"],
 [4, "text", "name1", "text"]]

我想得到这个:

[[[1, "text", "name1", "text"],
  [4, "text", "name1", "text"]],
 [[2, "text", "name2", "text"],
  [3, "text", "name2", "text"]]]

我可以通过遍历并仅跟踪找到的内容(O(n ^ 2))来考虑幼稚的方式.但是我认为有更好的方法.

I can think of the naive way by looping through and just keeping track of what is found (O(n^2)). But I would assume there's a better way.

推荐答案

您可以对groupby进行排序和使用,但这是O(n log n):

You could sorted and use groupby but that is O(n log n):

from operator import itemgetter
from itertools import groupby

print([list(v) for _,v in groupby( sorted(l,key=itemgetter(2)),itemgetter(2))])

或使用 OrderedDict 按第三个元素分组O(n)解决方案,方法是使用第三个元素作为键并将子列表附加为值. setdefault将处理重复的键:

Or use an OrderedDict grouping by the third element for an O(n) solution by using the third element as the key and appending the sublists as values. setdefault will handle the repeated keys:

from collections import OrderedDict

od = OrderedDict()

for sub in l:
    od.setdefault(sub[2],[]).append(sub)
from pprint import pprint as pp
pp(od.values())
[[[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']],
[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']]]

如果顺序无关紧要,则可以使用 defaultdict 代替OrderedDict.

If order does not matter you can use a defaultdict in place of the OrderedDict.

如果顺序无关紧要,则defaultdict迄今为止是最有效的.

If order does not matter a defaultdict is by far the most efficient.

In [7]: from itertools import groupby

In [8]: from collections import OrderedDict, defaultdict                               

In [9]: l = [[1, "text", "name{}".format(choice(list(range(2000)))), "text"] for _ in xrange(40000)]

In [13]: from operator import  itemgetter

In [14]: timeit [list(v) for _,v in groupby( sorted(l,key=itemgetter(2)),itemgetter(2))]
10 loops, best of 3: 42.5 ms per loop

In [15]: %%timeit                                                                       
od = defaultdict(list)
for sub in l:
    od[sub[2]].append(sub)
   ....: 
100 loops, best of 3: 9.42 ms per loop

In [16]: %%timeit                                                                       
od = OrderedDict()
for sub in l:
     od.setdefault(sub[2],[]).append(sub)
   ....: 
10 loops, best of 3: 25.5 ms per loop

In [17]: lists = l

In [18]: %%timeit
   ....: groupers = set(l[2] for l in lists)
   ....: [filter(lambda x: x[2] == y, lists) for y in groupers]
   ....: 

1 loops, best of 3: 8.48 s per loop

In [19]: timeit l = [filter(lambda x: x[2] == y, lists) for y in   set(l[2] for l in lists)]
1 loops, best of 3: 8.29 s per loop

因此,如果顺序无关紧要,则defaultdict获胜,groupby的性能仍然很好,因为与二次方法相比,sortby仍然很便宜.如您所见,过滤器的二次复杂度随着数据的增长而表现不佳.

So if order does not matter then defaultdict wins, groupby still performs pretty well as sort is still pretty cheap in comparison to a quadratic approach. As you can see filter's quadratic complexity performs badly as the data grows.

这篇关于Python-按索引将列表中的重复项分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆