Pandas:删除以任何顺序存在的重复项 [英] Pandas: remove duplicates that exist in any order

查看:49
本文介绍了Pandas:删除以任何顺序存在的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题类似于熊猫:从数据框中删除反向重复,但是我有一个额外的要求.我需要维护行值对.

My question is similar to Pandas: remove reverse duplicates from dataframe but I have an additional requirement. I need to maintain row value pairs.

例如:

我有 data,其中 A 列对应于 C 列,B 列对应于 列D.

I have data where column A corresponds to column C and column B corresponds to column D.

import pandas as pd

# Initial data frame
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 
                     'B': [50, 22, 35, 5, 10, 11, 21, 0],
                     'C': ["a", "b", "r", "x", "c", "w", "z", "y"],
                     'D': ["y", "c", "w", "z", "b", "r", "x", "a"]})
data

#    A   B  C  D
#0   0  50  a  y
#1  10  22  b  c
#2  11  35  r  w
#3  21   5  x  z
#4  22  10  c  b
#5  35  11  w  r
#6   5  21  z  x
#7  50   0  y  a

我想删除 AB 列中存在的重复项,但我需要在 CC 列中保留它们对应的字母值D.

I would like to remove duplicates that exist in columns A and B but I need to preserve their corresponding letter value in columns C and D.

我在这里有一个解决方案,但有没有更优雅的方法来做到这一点?

I have a solution here but is there a more elegant way of doing this?

# Desired data frame
new_data = pd.DataFrame()

# Concat numbers and corresponding letters
new_data['AC'] = data['A'].astype(str) + ',' + data['C']
new_data['BD'] = data['B'].astype(str) + ',' + data['D']

# Drop duplicates despite order
new_data = new_data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

# Recreate dataframe
new_data = pd.DataFrame.from_items(zip(new_data.index, new_data.values)).T
new_data = pd.concat([new_data.iloc[:,0].str.split(',', expand=True),
                      new_data.iloc[:,1].str.split(',', expand=True)], axis=1)
new_data.columns=['A', 'B', 'C', 'D']
new_data

#    A  B   C  D
#0   0  a  50  y
#1  10  b  22  c
#2  11  r  35  w
#3  21  x   5  z

EDIT 从技术上讲,输出应如下所示:

EDIT technically output should look like this:

new_data.columns=['A', 'C', 'B', 'D']
new_data

#    A  B   C  D
#0   0  a  50  y
#1  10  b  22  c
#2  11  r  35  w
#3  21  x   5  z

推荐答案

我认为你可以用 stackdrop_duplicatesunstack 做到这一点>:

I think that you can do this with stack, drop_duplicates and unstack:

data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()

    A   B  C  D
0   0  50  a  y
1  10  22  b  c
2  11  35  r  w
3  21   5  x  z

这篇关于Pandas:删除以任何顺序存在的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆