在R中逐行删除重复值 [英] Removing duplicate values row-wise in R

查看:83
本文介绍了在R中逐行删除重复值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用R中的数据集,但有一个我似乎无法弄清楚的问题.我的数据目前看起来像这样:

I am working with a dataset in R, and I have a problem that I can't seem to figure out. My data currently looks like this:

Team    Person1   Person2   Person3   Person4   Person5  Person6  Person7
6594794 37505959  37469784    NA         NA       NA        NA      NA
6595053 30113392  33080042  21537147  32293683    NA        NA      NA
6595201 697417    22860111  NA           NA       NA        NA      NA
6595380 24432987  32370372  11521625   362790   24432987 22312802 32432267
6595382 12317669  25645492  NA           NA       NA        NA      NA
6595444 8114419   236357    32545314  22247108    NA        NA      NA
6595459 2135269   32332907  32332907  32436550    NA        NA      NA
6595468 33590928  10905322  32319555  10439608    NA        NA      NA
6595485 33080810  33162061  NA           NA       NA        NA      NA
6595496 36901773  34931641  NA           NA       NA        NA      NA
6595523 512193    8747403   NA           NA       NA        NA      NA
6595524 32393404  113514    NA           NA       NA        NA      NA
6595526 37855554  37855512  NA           NA       NA        NA      NA
6595536 18603977  1882599   332261    10969771  712339  2206680  768785

这些列一直到"Person24".

The columns span all the way to "Person24".

我已经意识到,有些团队多次列出同一个人.因此,我需要找出一种方法来识别至少一个人的ID号码被多次列出的球队,或者创建所有这些球队ID的完整列表,或者简单地从数据集中删除这些球队.

What I've realized is that some teams have the same person listed more than once. So, I need to figure out a way to identify teams where at least one person's ID number is listed more than once, and either create a complete list of all these team IDs, or simply remove these teams from the dataset.

例如,团队#6595380(第4行)具有重复成员-人员#24432987出现在人员1"列和人员5"列中.另一个示例是团队#6595459(第7行)-人员#32332907出现在人员2"列和人员3"列中.因此,我正在寻找一种方法来吸引出现这种情况的团队,或者只是从数据集中删除它们.

For example, team #6595380 (4th row) has a repeat member - person #24432987 appears as in the Person1 column and the Person5 column. Another example is team #6595459 (7th row) - person #32332907 appears in the Person2 column and Person3 column. So, I am either looking for a way to take note teams with occurrences like this, or simply remove them from the dataset.

推荐答案

Yu可以使用apply

dat$dups <- apply(dat[-1], 1, function(i) any(duplicated(i[!is.na(i)])))

或正如西蒙·奥汉隆(Simon O'Hanlon)在评论中指出的那样

or as Simon O'Hanlon pointed out in the comments

dat$dups <- apply(dat[-1], 1, function(i) any(duplicated(i, incomparables = NA)))

然后,您可以使用它来查找重复的团队编号或将其排除:

You could then use this to either find the team numbers that have duplicates or to exclude them:

# Return teams that have duplicate person ids
dat$Team[ dat$dups ]
# Exclude rows with duplicates
dat[ ! dat$dups , ]

这篇关于在R中逐行删除重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆