R data.table 带有一对列的重复行 [英] R data.table duplicate rows with a pair of columns

查看:14
本文介绍了R data.table 带有一对列的重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

data.table 非常有用,但我找不到解决以下问题的优雅方法.那里有一些更接近的答案,但没有一个能解决我的问题.假设下面是 data.table 对象,我想根据基因对(Gene1 和 Gene2)过滤重复行,但两种方式都可以.

data.table is very useful but I could not find an elegant way to solve the following problem. There are some closer answers out there, but none solved my problem. Lets say the below is the data.table object and I want to filter duplicate rows based on the gene pairs (Gene1 and Gene2) but in both ways.

     Gene1    Gene2     Ens.ID.1              Ens.ID.2             CORR
1:   FOXA1    MYC       ENSG000000129.13.     ENSG000000129.11     0.9953311
2:   EGFR     CD4       ENSG000000129         ENSG000000129.12     0.9947215
3:   CD4      EGFR      ENSG000000129.12      ENSG000000129.11     0.9940735
4:   EGFR     CD4       ENSG000000129         ENSG000000129.12     0.9947215 

如果 Gene1 和 Gene2 有这样的重复,那么我想得到这个:

If there are such duplicates with respect to Gene1 and Gene2, then I want to get this:

     Gene1    Gene2     Ens.ID.1              Ens.ID.2             CORR
1:   FOXA1    MYC       ENSG000000129.13.     ENSG000000129.11     0.9953311
2:   EGFR     CD4       ENSG000000129         ENSG000000129.12     0.9947215

对数百万行进行标准编码非常慢.在 data.table 中是否有一种优雅而快速的方法?

It is very slow with standard coding over millions of rows. Is there an elegant and fast way of doing this in data.table?

推荐答案

链接答案(https://stackoverflow.com/a/25151395/496803) 几乎是重复的,https://stackoverflow.com/a/25298863/496803 也是如此 ,但又来了,稍微有点扭曲:

The linked answer ( https://stackoverflow.com/a/25151395/496803) is nearly a duplicate, and so is https://stackoverflow.com/a/25298863/496803 , but here goes again, with a slight twist:

dt[!duplicated(data.table(pmin(Gene1,Gene2),pmax(Gene1,Gene2)))]

#   Gene1 Gene2          Ens.ID.1         Ens.ID.2      CORR
#1: FOXA1   MYC ENSG000000129.13. ENSG000000129.11 0.9953311
#2:  EGFR   CD4     ENSG000000129 ENSG000000129.12 0.9947215

如果您有 >2 个或多个要删除的键,您最好转换为长文件、排序、回到宽文件,然后再删除.像这样:

If you have >2 or many keys to dedup by, you are probably best off converting to a long file, sorting, back to a wide file and then de-duplicating. Like so:

dupvars <- c("Gene1","Gene2")
sel <- !duplicated(
  dcast(
      melt(dt[, c(.SD,id=.(.I)), .SDcols=dupvars], id.vars="id")[
          order(id,value), grp := seq_len(.N), by=id],
      id ~ grp
  )[,-1])
dt[sel,]

这篇关于R data.table 带有一对列的重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆