删除列表列中的重复列表元素 [英] Drop duplicate list elements in column of lists
问题描述
这是我的数据框:
pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],'B':[0, 2, 3, 4, 5, 6, 7],'C':[[1,4,4,4], [1,4,4,4], [3,4,4,5], [3,4,4,5], [4,4,2,1], [1,2,3,4,], [7,8,9,1]]})
我想设置\删除每行 C 列的重复值,但不删除重复的行.
这是我希望得到的:
pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],'B':[0, 2, 3, 4, 5, 6, 7],'C':[[1,4], [1,4], [3,4,5], [3,4,5], [4,2,1], [1,2,3,4,], [7,8,9,1]]})
如果你使用的是 python 3.7>,你可以
This is my dataframe:
pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
'B':[0, 2, 3, 4, 5, 6, 7],
'C':[[1,4,4,4], [1,4,4,4], [3,4,4,5], [3,4,4,5], [4,4,2,1], [1,2,3,4,], [7,8,9,1]]})
I want to get set\drop duplicate values of column C per row but not drop duplicate rows.
This what I hope to get:
pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
'B':[0, 2, 3, 4, 5, 6, 7],
'C':[[1,4], [1,4], [3,4,5], [3,4,5], [4,2,1], [1,2,3,4,], [7,8,9,1]]})
If you're using python 3.7>, you could could map
with dict.fromkeys
, and obtain a list from the dictionary keys (the version is relevant since insertion order is maintained starting from there):
df['C'] = df.C.map(lambda x: list(dict.fromkeys(x).keys()))
For older pythons you have collections.OrderedDict
:
from collections import OrderedDict
df['c']= df.C.map(lambda x: list(OrderedDict.fromkeys(x).keys()))
print(df)
A B C
0 1 0 [1, 4]
1 3 2 [1, 4]
2 3 3 [3, 4, 5]
3 4 4 [3, 4, 5]
4 5 5 [4, 2, 1]
5 3 6 [1, 2, 3, 4]
6 3 7 [7, 8, 9, 1]
As mentioned by cs95 in the comments, if we don't need to preserve order we could go with a set
for a more concise approach:
df['c'] = df.C.map(lambda x: [*{*x}])
Since several approaches have been proposed and is hard to tell how they will perform on large dataframes, probably worth benchmarking:
df = pd.concat([df]*50000, axis=0).reset_index(drop=True)
perfplot.show(
setup=lambda n: df.iloc[:int(n)],
kernels=[
lambda df: df.C.map(lambda x: list(dict.fromkeys(x).keys())),
lambda df: df['C'].map(lambda x: pd.factorize(x)[1]),
lambda df: [np.unique(item) for item in df['C'].values],
lambda df: df['C'].explode().groupby(level=0).unique(),
lambda df: df.C.map(lambda x: [*{*x}]),
],
labels=['dict.from_keys', 'factorize', 'np.unique', 'explode', 'set'],
n_range=[2**k for k in range(0, 18)],
xlabel='N',
equality_check=None
)
这篇关于删除列表列中的重复列表元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!