R data.table具有一对列的重复行 [英] R data.table duplicate rows with a pair of columns
问题描述
data.table非常有用,但是我找不到解决以下问题的好方法。有一些更详细的答案,但没有一个解决我的问题。
可以说下面是data.table对象,我想基于基因对(Gene1和Gene2)过滤重复行,但同时使用两种方式。
data.table is very useful but I could not find an elegant way to solve the following problem. There are some closer answers out there, but none solved my problem. Lets say the below is the data.table object and I want to filter duplicate rows based on the gene pairs (Gene1 and Gene2) but in both ways.
Gene1 Gene2 Ens.ID.1 Ens.ID.2 CORR
1: FOXA1 MYC ENSG000000129.13. ENSG000000129.11 0.9953311
2: EGFR CD4 ENSG000000129 ENSG000000129.12 0.9947215
3: CD4 EGFR ENSG000000129.12 ENSG000000129.11 0.9940735
4: EGFR CD4 ENSG000000129 ENSG000000129.12 0.9947215
如果Gene1和Gene2有这样的重复,那么我想得到这个:
If there are such duplicates with respect to Gene1 and Gene2, then I want to get this:
Gene1 Gene2 Ens.ID.1 Ens.ID.2 CORR
1: FOXA1 MYC ENSG000000129.13. ENSG000000129.11 0.9953311
2: EGFR CD4 ENSG000000129 ENSG000000129.12 0.9947215
它非常慢具有数百万行的标准编码。
在data.table中是否有一种优雅,快速的方法?
It is very slow with standard coding over millions of rows. Is there an elegant and fast way of doing this in data.table?
推荐答案
链接的答案( https://stackoverflow.com/a/25151395/496803 )几乎是重复的, https://stackoverflow.com/a/25298863/496803 ,但这又一次做了些许改动,
The linked answer ( https://stackoverflow.com/a/25151395/496803) is nearly a duplicate, and so is https://stackoverflow.com/a/25298863/496803 , but here goes again, with a slight twist:
dt[!duplicated(data.table(pmin(Gene1,Gene2),pmax(Gene1,Gene2)))]
# Gene1 Gene2 Ens.ID.1 Ens.ID.2 CORR
#1: FOXA1 MYC ENSG000000129.13. ENSG000000129.11 0.9953311
#2: EGFR CD4 ENSG000000129 ENSG000000129.12 0.9947215
如果您有> 2或许多要重复使用的键,最好将其转换为一个长文件,进行排序,再返回一个宽文件,然后进行重复数据删除。像这样:
If you have >2 or many keys to dedup by, you are probably best off converting to a long file, sorting, back to a wide file and then de-duplicating. Like so:
dupvars <- c("Gene1","Gene2")
sel <- !duplicated(
dcast(
melt(dt[, c(.SD,id=.(.I)), .SDcols=dupvars], id.vars="id")[
order(id,value), grp := seq_len(.N), by=id],
id ~ grp
)[,-1])
dt[sel,]
这篇关于R data.table具有一对列的重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!