有条件地删除重复项 [英] Conditionally removing duplicates

查看:112
本文介绍了有条件地删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中我需要根据另一列中的值有条件地删除重复的行.

I have a dataset in which I need to conditionally remove duplicated rows based on values in another column.

具体来说,只有在SampleID是重复的 时,我才需要删除size = 0的任何行.

Specifically, I need to delete any row where size = 0 only if SampleID is duplicated.

SampleID<-c("a", "a", "b", "b", "b", "c", "d", "d", "e")
size<-c(0, 1, 1, 2, 3, 0, 0, 1, 0)
data<-data.frame(SampleID, size)

我要使用以下内容删除行:

I want to delete rows with:

Sample ID   size
a           0
d           0

并保留:

SampleID   size
a          1
b          1
b          2
b          3
c          0
d          1
e          0

注意.实际的数据集非常大,因此我不寻求一种仅按行号删除已知行的方法.

Note. actual dataset it very large, so I am not looking for a way to just remove a known row by row number.

推荐答案

使用data.table框架:将您的集合转换为data.table

Using data.table framework: Transform your set to data.table

require(data.table)
setDT(data)

建立一个ID列表,我们可以在其中删除行:

Build a list of id where we can delete lines:

dropable_ids = unique(data[size != 0, SampleID])

最后保留不在可删除列表中或具有非0值的行

Finaly keep lines that are not in the dropable list or with non 0 value

data = data[!(SampleID %in% dropable_ids & size == 0), ]

请注意,not( a and b )等同于a or b,但是data.table框架不能很好地处理or.

Please note that not( a and b ) is equivalent to a or b but data.table framework doesn't handle well or.

希望有帮助

这篇关于有条件地删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆