将一条记录与所有其他记录进行比较以删除重复项 - python 或 R [英] Comparing one record with all other to remove duplicates - python or R
问题描述
我有一个包含所有世界杯比赛的数据集,其中包含日期、团队 A、团队 B 和其他一些列.但是这个数据集里面有重复,比如印度对澳大利亚的比赛,有两条记录如下,
I have a data set which contains all the world cup matches with columns Date,Team A, Team B and some other columns. But this data set has duplicates in it, like for a India Vs Australia match, there are two records as below,
DATE Team A Team B
24-May-1983 India Australia
24-May-1983 Australia India
我可以通过 python 中的两个 for 循环删除重复记录,但是在 N * M 比较和许多 if 条件和 for 循环中效率低下.在 python 或 R 中有没有一种有效的方法来做到这一点?
I can remove the duplicate records through two for loops in python but that will be inefficient with N * M comparisons and many if conditions and for loop in it. Is there an efficient way to do this in python or R ?
提前致谢.
推荐答案
这样的事情可能没问题,你只需要按字母顺序排列团队,所以无论哪一个被记录为团队 A 还是团队 B 都没有关系:
Something like this is probably OK, you just need to put the teams in alphabetical order so it doesn't matter which one was recorded as Team A versus team B:
df['team_tuple'] = df.apply(
lambda row: tuple(
sorted((row['Team A'], row['Team B']))
),
axis='columns'
)
df
Out[17]:
DATE Team A Team B team_tuple
0 24-May-1983 India Australia (Australia, India)
1 24-May-1983 Australia India (Australia, India)
duplicates = df.loc[:, ['DATE', 'team_tuple']].duplicated()
cleaned_df = df.loc[~ duplicates, :]
In [16]: cleaned_df
Out[16]:
DATE Team A Team B team_tuple
0 24-May-1983 India Australia (Australia, India)
这篇关于将一条记录与所有其他记录进行比较以删除重复项 - python 或 R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!