如何从列表中删除相似但不重复的项目? [英] How can I remove similar but not duplicate items from a list?

查看:83
本文介绍了如何从列表中删除相似但不重复的项目?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个列表:

values = [[6.23234121,6.23246575],[1.352672,1.352689],[6.3245,123.35323,2.3]]

如何浏览此列表并将 0.01 之内的所有项目删除到同一列表中的其他元素?

What is a way I can go through this list and remove all items that are within say 0.01 to other elements in the same list.

我知道如何使用del对一组特定的列表执行此操作,但我希望它具有一般性,如果值中包含n个列表,并且每个列表具有n个元素.

I know how to do it for a specific set of lists using del, but I want it to be general for if values has n lists in it and each list has n elements.

我想要发生的是在此列表上执行一些操作

What I want to happen is perform some operation on this list

values = [[6.23234121,6.23246575],[1.352672,1.352689],[6.3245,123.35323,2.3]]

并获得此输出

new_values = [[6.23234121],[1.352672],[6.3245,123.35323,2.3]]

推荐答案

我将编写一个函数来为单个列表执行此操作,例如

I'm going to write a function to do this for a single list, eg

>>> compact([6.23234121,6.23246575], tol=.01)
[6.23234121]

然后您就可以通过[compact(l) for l in lst]将其用于嵌套结构.

You can then get it to work on your nested structure through just [compact(l) for l in lst].

这些方法中的每一个都会将第一个没有任何关系的元素保留在列表中;对于@DSM的[0, 0.005, 0.01, 0.015, 0.02]示例,它们都将返回[0, 0.0.15](或者,如果将>切换为>=,则返回[0, 0.01, 0.02]).如果您想要其他与众不同的东西,则必须更精确地定义它.

Each of these methods will keep the first element that doesn't have anything closer to it in the list; for @DSM's example of [0, 0.005, 0.01, 0.015, 0.02] they'd all return [0, 0.0.15] (or, if you switch > to >=, [0, 0.01, 0.02]). If you want something different, you'll have to define exactly what it is more carefully.

首先,一种简单的方法,类似于David的回答.这是O(n ^ 2):

First, the easy approach, similar to David's answer. This is O(n^2):

def compact(lst, tol):
    new = []
    for el in lst:
        if all(abs(el - x) > tol for x in new):
            new.append(el)
    return compact


在三元素列表中,这非常好.但是,如果您想在三百万个元素的列表上执行此操作,那将不会减少它.让我们尝试一些不同的东西:


On three-element lists, that's perfectly nice. If you want to do it on three million-element lists, though, that's not going to cut it. Let's try something different:

import collections
import math

def compact(lst, tol):
    round_digits = -math.log10(tol) - 1
    seen = collections.defaultdict(set)
    new = []
    for el in lst:
        rounded = round(seen, round_digits)
        if all(abs(el - x) > tol for x in seen[rounded]):
            seen[rounded].add(el)
            new.append(el)
    return new

如果您的tol0.01,则round_digits是1.因此6.23234121seen中被索引为6.2.然后,我们看到6.23246575时,将其四舍五入为6.2并在索引中向上查找,该索引应包含所有可能在所查找数字的tol之内的数字.然后,我们仍然必须检查与这些数字的距离,但只能检查索引箱中的极少数数字,而不是整个列表.

If your tol is 0.01, then round_digits is 1. So 6.23234121 is indexed in seen as just 6.2. When we then see 6.23246575, we round it to 6.2 and look that up in the index, which should contain all numbers that could possibly be within tol of the number we're looking up. Then we still have to check distances to those numbers, but only on the very few numbers that are in that index bin, instead of the entire list.

这种方法是O(n k),其中k是将落入一个这样的bin中的元素的平均数量.仅当k<< n(通常会这样,但这取决于您使用的相对于tol的数字的分布).请注意,它使用的内存也可能是其他方法的两倍以上,这对于很大的列表可能是个问题.

This approach is O(n k), where k is the average number of elements that'll fall within one such bin. It'll only be helpful if k << n (as it typically would be, but that depends on the distribution of the numbers you're using relative to tol). Note that it also uses probably more than twice as much memory as the other approach, which could be an issue for very large lists.

另一种选择是首先对列表进行排序;那么您只需查看上一个和下一个元素即可检查是否存在冲突.

Another option would be to sort the list first; then you only have to look at the previous and following elements to check for a conflict.

这篇关于如何从列表中删除相似但不重复的项目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆