如何在保持原始列表顺序的同时从列表中删除不区分大小写的重复项? [英] How to remove case-insensitive duplicates from a list, while maintaining the original list order?

查看:78
本文介绍了如何在保持原始列表顺序的同时从列表中删除不区分大小写的重复项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串列表,例如:

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]

我想要这个结果(这是唯一可以接受的结果):

myList = ["paper", "Plastic", "aluminum", "tin", "glass", "Polypropylene Plastic"]

请注意,如果一个项目("Polypropylene Plastic")恰好包含另一个项目("Plastic"),我仍然希望保留这两个项目.因此,情况可能有所不同,但该项目必须是字母对字母的匹配,才能将其删除.

原始列表顺序必须保留.该项目在第一个实例之后的所有 都应删除.该第一例的原始案例以及所有非重复项的原始案例都应保留.

我已经搜索过,只找到了解决一个或另一个需求的问题,而不是两个都满足.

解决方案

由于您需要过滤掉重复项,因此累积/内存效应很难用列表理解(或以清楚为代价)进行编码. /p>

使用set理解也是不可能的,因为它会破坏原始顺序.

带循环和辅助set的经典方式,用于存储遇到的字符串的小写版本.仅当小写版本不在集合中时,才将字符串存储在结果列表中

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]
result=[]

marker = set()

for l in myList:
    ll = l.lower()
    if ll not in marker:   # test presence
        marker.add(ll)
        result.append(l)   # preserve order

print(result)

结果:

['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']

使用.casefold()而不是.lower()可以处理某些语言环境中的细微套管"差异(例如Strasse/Straße中的德语双"s").

可以通过列表理解来做到这一点,但这确实很hacky:

marker = set()
result = [not marker.add(x.casefold()) and x for x in myList if x.casefold() not in marker]

它在set.addNone输出上使用and来调用此函数(列表理解中的副作用,很少有好处...),无论如何返回x.主要缺点是:

  • 可读性
  • casefold()被调用两次,一次用于测试,一次用于存储在标记集中的事实

I have a list of strings such as:

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]

I want this outcome (and this is the only acceptable outcome):

myList = ["paper", "Plastic", "aluminum", "tin", "glass", "Polypropylene Plastic"]

Note that if an item ("Polypropylene Plastic") happens to contain another item ("Plastic"), I would still like to retain both items. So, the cases can be different, but the item must be a letter-for-letter match, for it to be removed.

The original list order must be retained. All duplicates after the first instance of that item should be removed. The original case of that first instance should be preserved, as well as the original cases of all non-duplicate items.

I've searched and only found questions that address one need or the other, not both.

解决方案

It's difficult to code that with a list comprehension (or at the expense of clarity) because of the accumulation/memory effect that you need to filter out duplicates.

It's also not possible to use a set comprehension because it destroys the original order.

Classic way with a loop and an auxiliary set where you store the lowercase version of the strings you're encountering. Store the string in the result list only if the lowercased version isn't in the set

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]
result=[]

marker = set()

for l in myList:
    ll = l.lower()
    if ll not in marker:   # test presence
        marker.add(ll)
        result.append(l)   # preserve order

print(result)

result:

['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']

using .casefold() instead of .lower() allows to handle subtle "casing" differences in some locales (like the german double "s" in Strasse/Straße).

Edit: it is possible to do that with a list comprehension, but it's really hacky:

marker = set()
result = [not marker.add(x.casefold()) and x for x in myList if x.casefold() not in marker]

It's using and on the None output of set.add to call this function (side effect in a list comprehension, rarely a good thing...), and to return x no matter what. The main disavantages are:

  • readability
  • the fact that casefold() is called twice, once for testing, once for storing in the marker set

这篇关于如何在保持原始列表顺序的同时从列表中删除不区分大小写的重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆