Python-删除基于两个列组合的数据框中的重复项? [英] Python - Delete duplicates in a dataframe based on two columns combinations?

查看:345
本文介绍了Python-删除基于两个列组合的数据框中的重复项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Python中有一个包含3列的数据框:

I have a dataframe with 3 columns in Python:

Name1 Name2 Value
Juan  Ale   1
Ale   Juan  1

,并希望消除基于Name1和Name2组合列的重复项.

and would like to eliminate the duplicates based on columns Name1 and Name2 combinations.

在我的示例中,两行相等(但是顺序不同),我想删除第二行并保留第一行,所以最终结果应该是:

In my example both rows are equal (but they are in different order), and I would like to delete the second row and just keep the first one, so the end result should be:

Name1 Name2 Value
Juan  Ale   1

任何想法都将不胜感激!

Any idea will be really appreciated!

推荐答案

您可以转换为frozenset并使用

You can convert to frozenset and use pd.DataFrame.duplicated.

res = df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]

print(res)

  Name1 Name2  Value
0  Juan   Ale      1

因为duplicated使用散列检查重复项,所以

frozenset而不是set是必需的.

frozenset is necessary instead of set since duplicated uses hashing to check for duplicates.

与行相比,对列的缩放更好.对于大量行,请使用@Wen的基于排序的算法.

Scales better with columns than rows. For a large number of rows, use @Wen's sort-based algorithm.

这篇关于Python-删除基于两个列组合的数据框中的重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆