R data.table 带有一对列的重复行 [英] R data.table duplicate rows with a pair of columns
问题描述
data.table 非常有用,但我找不到解决以下问题的优雅方法.那里有一些更接近的答案,但没有一个能解决我的问题.假设下面是 data.table 对象,我想根据基因对(Gene1 和 Gene2)过滤重复行,但两种方式都可以.
data.table is very useful but I could not find an elegant way to solve the following problem. There are some closer answers out there, but none solved my problem. Lets say the below is the data.table object and I want to filter duplicate rows based on the gene pairs (Gene1 and Gene2) but in both ways.
Gene1 Gene2 Ens.ID.1 Ens.ID.2 CORR
1: FOXA1 MYC ENSG000000129.13. ENSG000000129.11 0.9953311
2: EGFR CD4 ENSG000000129 ENSG000000129.12 0.9947215
3: CD4 EGFR ENSG000000129.12 ENSG000000129.11 0.9940735
4: EGFR CD4 ENSG000000129 ENSG000000129.12 0.9947215
如果 Gene1 和 Gene2 有这样的重复,那么我想得到这个:
If there are such duplicates with respect to Gene1 and Gene2, then I want to get this:
Gene1 Gene2 Ens.ID.1 Ens.ID.2 CORR
1: FOXA1 MYC ENSG000000129.13. ENSG000000129.11 0.9953311
2: EGFR CD4 ENSG000000129 ENSG000000129.12 0.9947215
对数百万行进行标准编码非常慢.在 data.table 中是否有一种优雅而快速的方法?
It is very slow with standard coding over millions of rows. Is there an elegant and fast way of doing this in data.table?
推荐答案
链接答案(https://stackoverflow.com/a/25151395/496803) 几乎是重复的,https://stackoverflow.com/a/25298863/496803 也是如此 ,但又来了,稍微有点扭曲:
The linked answer ( https://stackoverflow.com/a/25151395/496803) is nearly a duplicate, and so is https://stackoverflow.com/a/25298863/496803 , but here goes again, with a slight twist:
dt[!duplicated(data.table(pmin(Gene1,Gene2),pmax(Gene1,Gene2)))]
# Gene1 Gene2 Ens.ID.1 Ens.ID.2 CORR
#1: FOXA1 MYC ENSG000000129.13. ENSG000000129.11 0.9953311
#2: EGFR CD4 ENSG000000129 ENSG000000129.12 0.9947215
如果您有 >2 个或多个要删除的键,您最好转换为长文件、排序、回到宽文件,然后再删除.像这样:
If you have >2 or many keys to dedup by, you are probably best off converting to a long file, sorting, back to a wide file and then de-duplicating. Like so:
dupvars <- c("Gene1","Gene2")
sel <- !duplicated(
dcast(
melt(dt[, c(.SD,id=.(.I)), .SDcols=dupvars], id.vars="id")[
order(id,value), grp := seq_len(.N), by=id],
id ~ grp
)[,-1])
dt[sel,]
这篇关于R data.table 带有一对列的重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!