R data.table具有一对列的重复行 [英] R data.table duplicate rows with a pair of columns

查看:61
本文介绍了R data.table具有一对列的重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

data.table非常有用,但是我找不到解决以下问题的好方法。有一些更详细的答案,但没有一个解决我的问题。
可以说下面是data.table对象,我想基于基因对(Gene1和Gene2)过滤重复行,但同时使用两种方式。

data.table is very useful but I could not find an elegant way to solve the following problem. There are some closer answers out there, but none solved my problem. Lets say the below is the data.table object and I want to filter duplicate rows based on the gene pairs (Gene1 and Gene2) but in both ways.

     Gene1    Gene2     Ens.ID.1              Ens.ID.2             CORR
1:   FOXA1    MYC       ENSG000000129.13.     ENSG000000129.11     0.9953311
2:   EGFR     CD4       ENSG000000129         ENSG000000129.12     0.9947215
3:   CD4      EGFR      ENSG000000129.12      ENSG000000129.11     0.9940735
4:   EGFR     CD4       ENSG000000129         ENSG000000129.12     0.9947215 

如果Gene1和Gene2有这样的重复,那么我想得到这个:

If there are such duplicates with respect to Gene1 and Gene2, then I want to get this:

     Gene1    Gene2     Ens.ID.1              Ens.ID.2             CORR
1:   FOXA1    MYC       ENSG000000129.13.     ENSG000000129.11     0.9953311
2:   EGFR     CD4       ENSG000000129         ENSG000000129.12     0.9947215

它非常慢具有数百万行的标准编码。
在data.table中是否有一种优雅,快速的方法?

It is very slow with standard coding over millions of rows. Is there an elegant and fast way of doing this in data.table?

推荐答案

链接的答案( https://stackoverflow.com/a/25151395/496803 )几乎是重复的, https://stackoverflow.com/a/25298863/496803 ,但这又一次做了些许改动,

The linked answer ( https://stackoverflow.com/a/25151395/496803) is nearly a duplicate, and so is https://stackoverflow.com/a/25298863/496803 , but here goes again, with a slight twist:

dt[!duplicated(data.table(pmin(Gene1,Gene2),pmax(Gene1,Gene2)))]

#   Gene1 Gene2          Ens.ID.1         Ens.ID.2      CORR
#1: FOXA1   MYC ENSG000000129.13. ENSG000000129.11 0.9953311
#2:  EGFR   CD4     ENSG000000129 ENSG000000129.12 0.9947215

如果您有> 2或许多要重复使用的键,最好将其转换为一个长文件,进行排序,再返回一个宽文件,然后进行重复数据删除。像这样:

If you have >2 or many keys to dedup by, you are probably best off converting to a long file, sorting, back to a wide file and then de-duplicating. Like so:

dupvars <- c("Gene1","Gene2")
sel <- !duplicated(
  dcast(
      melt(dt[, c(.SD,id=.(.I)), .SDcols=dupvars], id.vars="id")[
          order(id,value), grp := seq_len(.N), by=id],
      id ~ grp
  )[,-1])
dt[sel,]

这篇关于R data.table具有一对列的重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆