在不同的列中删除无序的重复项 [英] Drop unordered duplicates across separate columns
问题描述
我试图返回一个 df
,其中重复的值已被删除.我尝试使用 drop.duplicates()
但列中的值是 subset
没有排序.例如,这些值是重复的,但它们的顺序不同.
I am trying to return a df
where duplicate values have been removed. I have tried to use drop.duplicates()
but the values in the columns which have been subset
aren't ordered. As in, the values are duplicates but they aren't in the same order.
例如,使用下面的 df
,如果我尝试从 Item_X
和 Item_Y
中删除重复值,它将返回相同的 df
.预期输出将删除第二行.
For instance, using the df
below, if I try to remove duplicate values from Item_X
and Item_Y
it will return the same df
. Where the intended output will remove the second row.
import pandas as pd
d = ({
'Item_X' : ['Foo','Bar','Bot','Bot','Bar','Foo'],
'Item_Y' : ['Bar','Foo','Foo','Bot','Bar','Foo'],
'Value' : [1,2,3,4,5,6],
})
df = pd.DataFrame(data = d)
df.drop_duplicates(subset=['Item_X','Item_Y'])
预期结果:
Item_X Item_Y Value
0 Foo Bar 1
2 Bot Foo 3
3 Bot Bot 4
4 Bar Bar 5
5 Foo Foo 6
实际输出(不正确):
Item_X Item_Y Value
0 Foo Bar 1
1 Bar Foo 2
2 Bot Foo 3
3 Bot Bot 4
4 Bar Bar 5
5 Foo Foo 6
解决这个问题最有效的方法是什么?
What would be the most efficient way to tackle this problem?
推荐答案
您需要沿水平轴对列进行排序,然后获取掩码以对原始帧进行子集化.以下是如何使用 np.sort
和 df.duplicated
来做到这一点:
You'll need to sort the columns along the horizontal axis, then get a mask to subset the original frame. Here's how you can use np.sort
and df.duplicated
to do that:
df[~pd.DataFrame(np.sort(df2[['Item_X', 'Item_Y']], axis=1)).duplicated()]
Item_X Item_Y Value
0 Foo Bar 1
2 Bot Foo 3
3 Bot Bot 4
4 Bar Bar 5
5 Foo Foo 6
这篇关于在不同的列中删除无序的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!