( pandas )根据顺序无关紧要的子集删除重复项 [英] (pandas) Drop duplicates based on subset where order doesn't matter

查看:58
本文介绍了( pandas )根据顺序无关紧要的子集删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从这个 df 开始的正确方法是什么:

<预><代码>>>>df=pd.DataFrame({'a':['jeff','bob','jill'], 'b':['bob','jeff','mike']})>>>df乙0 杰夫鲍勃1 鲍勃杰夫2 吉尔迈克

为此:

<预><代码>>>>df2乙0 杰夫鲍勃2 吉尔迈克

根据a"和b"中的项目删除重复行的位置,而不考虑它们的特定列.

我可以使用 lambda 表达式组合一个解决方案来创建掩码,然后根据掩码列删除重复项,但我认为必须有比这更简单的方法:

<预><代码>>>>df['c'] = df[['a', 'b']].apply(lambda x: ''.join(sorted((x[0], x[1]), \key=lambda x: x[0]) + sorted((x[0], x[1]), key=lambda x: x[1] )), axis=1)>>>df.drop_duplicates(subset='c', keep='first', inplace=True)>>>df = df.iloc[:,:-1]

解决方案

我认为你可以独立地对每一行进行排序,然后使用重复的来查看要删除哪些.

dupes = df.apply(lambda x: x.sort_values().values, axis=1).duplicated()df[~欺骗]

一种更快的获得欺骗的方法.感谢@DSM.

dupes = df.T.apply(sorted).T.duplicated()

What is the proper way to go from this df:

>>> df=pd.DataFrame({'a':['jeff','bob','jill'], 'b':['bob','jeff','mike']})
>>> df
      a     b
0  jeff   bob
1   bob  jeff
2  jill  mike

To this:

>>> df2
      a     b
0  jeff   bob
2  jill  mike

where you're dropping a duplicate row based on the items in 'a' and 'b', without regard to the their specific column.

I can hack together a solution using a lambda expression to create a mask and then drop duplicates based on the mask column, but I'm thinking there has to be a simpler way than this:

>>> df['c'] = df[['a', 'b']].apply(lambda x: ''.join(sorted((x[0], x[1]), \
 key=lambda x: x[0]) + sorted((x[0], x[1]), key=lambda x: x[1] )), axis=1)
>>> df.drop_duplicates(subset='c', keep='first', inplace=True)
>>> df = df.iloc[:,:-1]

解决方案

I think you can sort each row independently and then use duplicated to see which ones to drop.

dupes = df.apply(lambda x: x.sort_values().values, axis=1).duplicated()
df[~dupes]

A faster way to get dupes. Thanks to @DSM.

dupes = df.T.apply(sorted).T.duplicated()

这篇关于( pandas )根据顺序无关紧要的子集删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆