将一条记录与所有其他记录进行比较以删除重复项 - python 或 R [英] Comparing one record with all other to remove duplicates - python or R

查看:34
本文介绍了将一条记录与所有其他记录进行比较以删除重复项 - python 或 R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含所有世界杯比赛的数据集,其中包含日期、团队 A、团队 B 和其他一些列.但是这个数据集里面有重复,比如印度对澳大利亚的比赛,有两条记录如下,

I have a data set which contains all the world cup matches with columns Date,Team A, Team B and some other columns. But this data set has duplicates in it, like for a India Vs Australia match, there are two records as below,

DATE          Team A      Team B
24-May-1983   India       Australia
24-May-1983   Australia   India

我可以通过 python 中的两个 for 循环删除重复记录,但是在 N * M 比较和许多 if 条件和 for 循环中效率低下.在 python 或 R 中有没有一种有效的方法来做到这一点?

I can remove the duplicate records through two for loops in python but that will be inefficient with N * M comparisons and many if conditions and for loop in it. Is there an efficient way to do this in python or R ?

提前致谢.

推荐答案

这样的事情可能没问题,你只需要按字母顺序排列团队,所以无论哪一个被记录为团队 A 还是团队 B 都没有关系:

Something like this is probably OK, you just need to put the teams in alphabetical order so it doesn't matter which one was recorded as Team A versus team B:

df['team_tuple'] = df.apply(
    lambda row: tuple(
        sorted((row['Team A'], row['Team B']))
    ), 
    axis='columns'
)
df
Out[17]: 
          DATE     Team A     Team B          team_tuple
0  24-May-1983      India  Australia  (Australia, India)
1  24-May-1983  Australia      India  (Australia, India)

duplicates = df.loc[:, ['DATE', 'team_tuple']].duplicated()
cleaned_df = df.loc[~ duplicates, :]
In [16]: cleaned_df
Out[16]: 
          DATE Team A     Team B          team_tuple
0  24-May-1983  India  Australia  (Australia, India)

这篇关于将一条记录与所有其他记录进行比较以删除重复项 - python 或 R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆