在相邻列中平均值时重复输入池 [英] Duplicate entry pooling while averaging values in neighbouring columns

查看:97
本文介绍了在相邻列中平均值时重复输入池的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试一些复杂的索引,同时平均,汇总以及最小和最大值。要开始这是一个例子 data.frame

I'm trying to do some complicated indexing while at the same time averaging, pooling as well as taking the min and max. To get started this is an example data.frame:

ID      ID2         probe       chrom   strand txStart  txEnd
Rest_3  uc001aah.4  8044649     chr1    0      14361    29370
Rest_4  uc001aah.4  7911309     chr1    0      14361    29370    
Rest_5  uc001aah.4  8171066     chr1    0      14361    29370           
Rest_6  uc001aah.4  8159790     chr1    0      14361    29370   

Rest_17 uc001abw.1  7896761     chr1    0      861120   879961
Rest_18 uc001abx.1  7896761     chr1    0      871151   879961

我使用重复在ID2列中找到重复的内容:uc001aah.4有4个重复。但是我需要的是什么,我不知道该怎么做,只有uc001aah.4的单个条目,然后将探测列(+其他一些)条目合并到单个单元格中(以excel为单位) 8044649,7911309,8171066,8159790 所以最终会如下所示:

I used duplicated to find the duplicated in the ID2 column: uc001aah.4 has 4 duplicated. But then what I need and what I don't know how to do is have only a single entry for uc001aah.4 and then pooling the probe column (+ some others) entries into a single cell (in terms of excel) 8044649, 7911309, 8171066, 8159790 so in the end it would look like this:

ID                              ID2         probe                                   chrom   strand txStart  txEnd
Rest_3,Rest_4, Rest_5, Rest_6   uc001aah.4  8044649, 7911309, 8171066, 8159790      chr1    0      14361    29370

但是,探针列的重复也是如此:

But then the duplication is also true for the probe column:

ID      ID2         probe       chrom   strand txStart  txEnd
Rest_17 uc001abw.1  7896761     chr1    0      861120   879961
Rest_18 uc001abx.1  7896761     chr1    0      871151   879961

所以这里我需要ID和ID2合并,同时使用列txStart的最小值和最大值列的txEnd的最后得到:

So here I need the ID and ID2 to be pooled while taking the minimum value of column txStart and the maximum of column txEnd in the end getting:

ID                  ID2                     probe       chrom   strand txStart  txEnd
Rest_17, Rest_18    uc001abw.1, uc001abx.1  7896761     chr1    0      861120   879961

我知道这是问了很多,但如果你只是告诉我如何在第一个问题上这样做,我相信我将能够找出如何应用于第二个问题。

I know this is asking a lot, but if you just show me how to do this on the first problem I'm sure I will be able to figure out how to apply that to the second problem.

推荐答案

使用 data.table 的解决方案:

require(data.table)
dt <- data.table(df)
> dt
#         ID        ID2   probe chrom strand txStart  txEnd
# 1:  Rest_3 uc001aah.4 8044649  chr1      0   14361  29370
# 2:  Rest_4 uc001aah.4 7911309  chr1      0   14361  29370
# 3:  Rest_5 uc001aah.4 8171066  chr1      0   14361  29370
# 4:  Rest_6 uc001aah.4 8159790  chr1      0   14361  29370
# 5: Rest_17 uc001abw.1 7896761  chr1      0  861120 879961
# 6: Rest_18 uc001abx.1 7896761  chr1      0  871151 879961

# step 1: remove duplicate ID2 and concatenate ID and probe.
# Note: here I assume that if ID2 is same, then so will be chrom, 
# strand, txStart and txEnd. If not, you can modify this similar 
# to what is in step 2.
dt.out <- dt[, lapply(.SD, function(x) paste(x, collapse=",")), 
          by=c("ID2", "chrom", "strand", "txStart", "txEnd")]

#           ID2 chrom strand txStart  txEnd                          ID                           probe
# 1: uc001aah.4  chr1      0   14361  29370 Rest_3,Rest_4,Rest_5,Rest_6 8044649,7911309,8171066,8159790
# 2: uc001abw.1  chr1      0  861120 879961                     Rest_17                         7896761
# 3: uc001abx.1  chr1      0  871151 879961                     Rest_18                         7896761

# step 2: remove duplicate probe and concatenate others, get min(txStart) and max(txEnd)
dt.out <- dt.out[ ,list(ID=paste(ID, collapse=","), ID2=paste(ID2, collapse=","), 
                       txStart=min(txStart), txEnd=max(txEnd)), 
                       by=c("probe", "chrom", "strand")]

#                              probe chrom strand                          ID                   ID2 txStart  txEnd
# 1: 8044649,7911309,8171066,8159790  chr1      0 Rest_3,Rest_4,Rest_5,Rest_6            uc001aah.4   14361  29370
# 2:                         7896761  chr1      0             Rest_17,Rest_18 uc001abw.1,uc001abx.1  861120 879961

这篇关于在相邻列中平均值时重复输入池的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆