pandas 发现交叉值重复 [英] Pandas find Duplicates in cross values

查看:42
本文介绍了 pandas 发现交叉值重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,想要消除重复的行,这些行具有相同的值,但在不同的列中:

I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:

df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})

df
Out[8]: 
   a  b  c  d
1  x  y  e  f
2  e  f  x  y
3  w  v  s  t

行[1],[2]的值为{x,y,e,f},但它们的排列形式为交叉-即,如果要在行[2]中将c,d列与a,b交换,您将有一个副本. 我要删除这些行,仅保留其中一行,以得到最终输出:

Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate. I want to drop these lines and only keep one, to have the final output:

df_new
Out[20]: 
   a  b  c  d
1  x  y  e  f
3  w  v  s  t

我如何有效地做到这一点?

How can I efficiently achieve that?

推荐答案

我认为您需要通过 numpy.sort

I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:

df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
   a  b  c  d
1  x  y  e  f
3  w  v  s  t

详细信息:

print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
 ['e' 'f' 'x' 'y']
 ['s' 't' 'v' 'w']]

print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
   0  1  2  3
1  e  f  x  y
2  e  f  x  y
3  s  t  v  w

print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1    False
2     True
3    False
dtype: bool

print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())

1     True
2    False
3     True
dtype: bool

这篇关于 pandas 发现交叉值重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆