从列表列表中删除重复项 [英] Removing duplicates from a list of lists

查看:63
本文介绍了从列表列表中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Python 列表:

I have a list of lists in Python:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

我想从中删除重复的元素.如果它是一个普通的列表,而不是我可以使用 set 的列表.但不幸的是,该列表不可散列,也无法制作列表集.只有元组.所以我可以将所有列表转换为元组,然后使用 set 并返回到列表.但这并不快.

And I want to remove duplicate elements from it. Was if it a normal list not of lists I could used set. But unfortunate that list is not hashable and can't make set of lists. Only of tuples. So I can turn all lists to tuples then use set and back to lists. But this isn't fast.

如何以最有效的方式做到这一点?

How can this done in the most efficient way?

上面列表的结果应该是:

The result of above list should be:

k = [[5, 6, 2], [1, 2], [3], [4]]

我不在乎保留顺序.

注意:这个问题类似但不完全我需要的.搜索 SO 但没有找到完全重复的内容.

Note: this question is similar but not quite what I need. Searched SO but didn't find exact duplicate.

基准测试:

import itertools, time


class Timer(object):
    def __init__(self, name=None):
        self.name = name

    def __enter__(self):
        self.tstart = time.time()

    def __exit__(self, type, value, traceback):
        if self.name:
            print '[%s]' % self.name,
        print 'Elapsed: %s' % (time.time() - self.tstart)


k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [6], [8], [9]] * 5
N = 100000

print len(k)

with Timer('set'):
    for i in xrange(N):
        kt = [tuple(i) for i in k]
        skt = set(kt)
        kk = [list(i) for i in skt]


with Timer('sort'):
    for i in xrange(N):
        ks = sorted(k)
        dedup = [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]


with Timer('groupby'):
    for i in xrange(N):
        k = sorted(k)
        dedup = list(k for k, _ in itertools.groupby(k))

with Timer('loop in'):
    for i in xrange(N):
        new_k = []
        for elem in k:
            if elem not in new_k:
                new_k.append(elem)

循环"(二次方法)是短列表中最快的.对于长列表,它比 groupby 方法之外的所有人都快.这有意义吗?

"loop in" (quadratic method) fastest of all for short lists. For long lists it's faster then everyone except groupby method. Does this make sense?

对于短列表(代码中的那个),100000 次迭代:

For short list (the one in the code), 100000 iterations:

[set] Elapsed: 1.3900001049
[sort] Elapsed: 0.891000032425
[groupby] Elapsed: 0.780999898911
[loop in] Elapsed: 0.578000068665

对于更长的列表(代码中重复了 5 次):

For longer list (the one in the code duplicated 5 times):

[set] Elapsed: 3.68700003624
[sort] Elapsed: 3.43799996376
[groupby] Elapsed: 1.03099989891
[loop in] Elapsed: 1.85900020599

推荐答案

>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> import itertools
>>> k.sort()
>>> list(k for k,_ in itertools.groupby(k))
[[1, 2], [3], [4], [5, 6, 2]]

itertools通常为此类问题提供最快和最强大的解决方案,并且很好值得深入熟悉!-)

编辑:正如我在评论中提到的,正常的优化工作都集中在大输入(大 O 方法)上,因为它更容易提供良好的工作回报.但有时(主要是针对推动性能极限边界的代码的深层内部循环中的悲惨的关键瓶颈")可能需要深入了解更多细节,提供概率分布,决定优化哪些性能指标(可能是上限或第 90 个百分位数比平均值或中位数更重要,具体取决于应用程序),在开始时执行可能的启发式检查以根据输入数据特征选择不同的算法,等等.

Edit: as I mention in a comment, normal optimization efforts are focused on large inputs (the big-O approach) because it's so much easier that it offers good returns on efforts. But sometimes (essentially for "tragically crucial bottlenecks" in deep inner loops of code that's pushing the boundaries of performance limits) one may need to go into much more detail, providing probability distributions, deciding which performance measures to optimize (maybe the upper bound or the 90th centile is more important than an average or median, depending on one's apps), performing possibly-heuristic checks at the start to pick different algorithms depending on input data characteristics, and so forth.

仔细测量点"性能(针对特定输入的代码 A 与代码 B)是这个极其昂贵的过程的一部分,标准库模块 timeit 对此有所帮助.但是,在 shell 提示符下使用它更容易.例如,这里有一个简短的模块来展示解决这个问题的一般方法,将其保存为 nodup.py:

Careful measurements of "point" performance (code A vs code B for a specific input) are a part of this extremely costly process, and standard library module timeit helps here. However, it's easier to use it at a shell prompt. For example, here's a short module to showcase the general approach for this problem, save it as nodup.py:

import itertools

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

def doset(k, map=map, list=list, set=set, tuple=tuple):
  return map(list, set(map(tuple, k)))

def dosort(k, sorted=sorted, xrange=xrange, len=len):
  ks = sorted(k)
  return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]

def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
  ks = sorted(k)
  return [i for i, _ in itertools.groupby(ks)]

def donewk(k):
  newk = []
  for i in k:
    if i not in newk:
      newk.append(i)
  return newk

# sanity check that all functions compute the same result and don't alter k
if __name__ == '__main__':
  savek = list(k)
  for f in doset, dosort, dogroupby, donewk:
    resk = f(k)
    assert k == savek
    print '%10s %s' % (f.__name__, sorted(resk))

注意健全性检查(在您只执行 python nodup.py 时执行)和基本提升技术(为每个函数设置本地的常量全局名称以提高速度)以使事情处于平等地位.

Note the sanity check (performed when you just do python nodup.py) and the basic hoisting technique (make constant global names local to each function for speed) to put things on equal footing.

现在我们可以对微小的示例列表进行检查:

Now we can run checks on the tiny example list:

$ python -mtimeit -s'import nodup' 'nodup.doset(nodup.k)'
100000 loops, best of 3: 11.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort(nodup.k)'
100000 loops, best of 3: 9.68 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby(nodup.k)'
100000 loops, best of 3: 8.74 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.donewk(nodup.k)'
100000 loops, best of 3: 4.44 usec per loop

确认二次方法具有足够小的常数,使其对具有很少重复值的小列表具有吸引力.有一个没有重复的短列表:

confirming that the quadratic approach has small-enough constants to make it attractive for tiny lists with few duplicated values. With a short list without duplicates:

$ python -mtimeit -s'import nodup' 'nodup.donewk([[i] for i in range(12)])'
10000 loops, best of 3: 25.4 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby([[i] for i in range(12)])'
10000 loops, best of 3: 23.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.doset([[i] for i in range(12)])'
10000 loops, best of 3: 31.3 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort([[i] for i in range(12)])'
10000 loops, best of 3: 25 usec per loop

二次方法还不错,但 sort 和 groupby 方法更好.等等等等

the quadratic approach isn't bad, but the sort and groupby ones are better. Etc, etc.

如果(正如对性能的痴迷所暗示的那样)此操作位于您的边界推应用程序的核心内部循环中,则值得在其他代表性输入样本上尝试相同的测试集,可能会检测到一些可以启发式地让您选择一种或另一种方法(但测量必须很快,当然).

If (as the obsession with performance suggests) this operation is at a core inner loop of your pushing-the-boundaries application, it's worth trying the same set of tests on other representative input samples, possibly detecting some simple measure that could heuristically let you pick one or the other approach (but the measure must be fast, of course).

同样值得考虑为 k 保留不同的表示——为什么它首先必须是一个列表而不是一组元组?例如,如果重复删除任务很频繁,并且分析表明它是程序的性能瓶颈,则始终保留一组元组并仅在需要时从中获取列表列表,总体上可能会更快,例如.

It's also well worth considering keeping a different representation for k -- why does it have to be a list of lists rather than a set of tuples in the first place? If the duplicate removal task is frequent, and profiling shows it to be the program's performance bottleneck, keeping a set of tuples all the time and getting a list of lists from it only if and where needed, might be faster overall, for example.

这篇关于从列表列表中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆