R在一列中查找重复项,并在第二列中折叠 [英] R finding duplicates in one column and collapsing in a second column

查看:534
本文介绍了R在一列中查找重复项,并在第二列中折叠的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框架,两列联系人字符串。在一列中(命名为 probes )我有重复的情况(也就是说,几个情况下使用相同的字符串)。对于探针中的每种情况,我想查找包含相同字符串的所有案例,然后将第二列(名为基因)中的所有相应案例的值合并到单个案例
例如,如果我有这样的结构:

 探针基因
1 cg00050873 TSPY4
2 cg00061679 DAZ1
3 cg00061679 DAZ4
4 cg00061679 DAZ4

我要更改这个结构:

 探针基因
1 cg00050873 TSPY4
2 cg00061679 DAZ1 DAZ4 DAZ4

显然没有问题,这样做一个单一的探针使用哪个,然后粘贴和折叠

  ind<  - 其中(olap $ probes ==cg00061679)
genename< ;-( olap [ind,2])
genecomb< -paste(genename [1:length(genename)],collapse =)

但我不知道如何在整个数据帧中提取probe列中的重复索引。任何想法?



提前感谢

解决方案

code>在基础R中单击

  data.frame(probes = unique(olap $探针),
基因=自由(olap $ genes,olap $ probes,paste,collapse =))

或使用plyr:

  library(plyr)
ddply(olap,probes总结基因= paste(基因,collapse =))

更新



在第一个版本中可能更安全:

   

只要以独一无二的方式将探测器以不同的顺序发送到 tapply 。我个人总是使用 ddply


I have a data frame with two columns contacting character strings. in one column (named probes) I have duplicated cases (that is, several cases with the same character string). for each case in probes I want to find all the cases containing the same string, and then merge the values of all the corresponding cases in the second column (named genes) into a single case. for example, if I have this structure:

    probes  genes
1   cg00050873  TSPY4
2   cg00061679  DAZ1
3   cg00061679  DAZ4
4   cg00061679  DAZ4

I want to change it to this structure:

    probes  genes
1   cg00050873  TSPY4
2   cg00061679  DAZ1 DAZ4 DAZ4

obviously there is no problem doing this for a single probe using which, and then paste and collapse

ind<-which(olap$probes=="cg00061679")
genename<-(olap[ind,2])
genecomb<-paste(genename[1:length(genename)], collapse=" ")

but I'm not sure how to extract the indices of the duplicates in probes column across the whole data frame. any ideas?

Thanks in advance

解决方案

You can use tapply in base R

data.frame(probes=unique(olap$probes), 
           genes=tapply(olap$genes, olap$probes, paste, collapse=" "))

or use plyr:

library(plyr)
ddply(olap, "probes", summarize, genes = paste(genes, collapse=" "))

UPDATE

It's probably safer in the first version to do this:

tmp <- tapply(olap$genes, olap$probes, paste, collapse=" ")
data.frame(probes=names(tmp), genes=tmp)

Just in case unique gives the probes in a different order to tapply. Personally I would always use ddply.

这篇关于R在一列中查找重复项,并在第二列中折叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆