用 R 删除反向重复项 [英] Deleting reversed duplicates with R

查看:24
本文介绍了用 R 删除反向重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中有一个数据框,其中包含拟南芥中旁系同源基因的基因 ID,如下所示:

I have a data frame in R that contains the gene ids of paralogous genes in Arabidopsis, looking something like this:

gene_x    gene_y
AT1       AT2
AT3       AT4
AT1       AT2
AT1       AT3
AT2       AT1

带有与基因名称对应的ATx".

with the 'ATx' corresponding to the gene names.

现在,对于下游分析,我只想继续使用唯一的对.有些对只是简单的重复项,可以使用 duplicated() 函数轻松删除.但是,上面人工数据框的第五行也是重复的,只是顺序相反,不会被duplicated()unique()拾取 函数.

Now, for downstream analysis, I would want to continue only with the unique pairs. Some pairs are just simple duplicates and can be removed easily upon using the duplicated() function. However, the fifth row in the artificial data frame above is also a duplicate, but in reversed order, and which will not be picked up by the duplicated(), nor by the unique() function.

关于如何删除这些行的任何想法?

Any ideas in how to remove these rows?

推荐答案

mydf <- read.table(text="gene_x    gene_y
AT1       AT2
AT3       AT4
AT1       AT2
AT1       AT3
AT2       AT1", header=TRUE, stringsAsFactors=FALSE)

这是使用 applysortpasteduplicated 的一种策略:

Here's one strategy using apply, sort, paste, and duplicated:

mydf[!duplicated(apply(mydf,1,function(x) paste(sort(x),collapse=''))),]
  gene_x gene_y
1    AT1    AT2
2    AT3    AT4
4    AT1    AT3

这里有一个稍微不同的解决方案:

And here's a slightly different solution:

mydf[!duplicated(lapply(as.data.frame(t(mydf), stringsAsFactors=FALSE), sort)),]
  gene_x gene_y
1    AT1    AT2
2    AT3    AT4
4    AT1    AT3

这篇关于用 R 删除反向重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆