在相邻列中平均值时重复输入池 [英] Duplicate entry pooling while averaging values in neighbouring columns

查看：97 发布时间：2017/3/26 3:10:33 r dataframe

本文介绍了在相邻列中平均值时重复输入池的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试一些复杂的索引，同时平均，汇总以及最小和最大值。要开始这是一个例子 data.frame ：

I'm trying to do some complicated indexing while at the same time averaging, pooling as well as taking the min and max. To get started this is an example data.frame:

ID      ID2         probe       chrom   strand txStart  txEnd
Rest_3  uc001aah.4  8044649     chr1    0      14361    29370
Rest_4  uc001aah.4  7911309     chr1    0      14361    29370    
Rest_5  uc001aah.4  8171066     chr1    0      14361    29370           
Rest_6  uc001aah.4  8159790     chr1    0      14361    29370   

Rest_17 uc001abw.1  7896761     chr1    0      861120   879961
Rest_18 uc001abx.1  7896761     chr1    0      871151   879961

我使用重复在ID2列中找到重复的内容：uc001aah.4有4个重复。但是我需要的是什么，我不知道该怎么做，只有uc001aah.4的单个条目，然后将探测列（+其他一些）条目合并到单个单元格中（以excel为单位） 8044649,7911309,8171066,8159790 所以最终会如下所示：

I used duplicated to find the duplicated in the ID2 column: uc001aah.4 has 4 duplicated. But then what I need and what I don't know how to do is have only a single entry for uc001aah.4 and then pooling the probe column (+ some others) entries into a single cell (in terms of excel) 8044649, 7911309, 8171066, 8159790 so in the end it would look like this:

ID                              ID2         probe                                   chrom   strand txStart  txEnd
Rest_3,Rest_4, Rest_5, Rest_6   uc001aah.4  8044649, 7911309, 8171066, 8159790      chr1    0      14361    29370

但是，探针列的重复也是如此：

But then the duplication is also true for the probe column:

ID      ID2         probe       chrom   strand txStart  txEnd
Rest_17 uc001abw.1  7896761     chr1    0      861120   879961
Rest_18 uc001abx.1  7896761     chr1    0      871151   879961

所以这里我需要ID和ID2合并，同时使用列txStart的最小值和最大值列的txEnd的最后得到：

So here I need the ID and ID2 to be pooled while taking the minimum value of column txStart and the maximum of column txEnd in the end getting:

ID                  ID2                     probe       chrom   strand txStart  txEnd
Rest_17, Rest_18    uc001abw.1, uc001abx.1  7896761     chr1    0      861120   879961

我知道这是问了很多，但如果你只是告诉我如何在第一个问题上这样做，我相信我将能够找出如何应用于第二个问题。

I know this is asking a lot, but if you just show me how to do this on the first problem I'm sure I will be able to figure out how to apply that to the second problem.

推荐答案

使用 data.table 的解决方案：

require(data.table)
dt <- data.table(df)
> dt
#         ID        ID2   probe chrom strand txStart  txEnd
# 1:  Rest_3 uc001aah.4 8044649  chr1      0   14361  29370
# 2:  Rest_4 uc001aah.4 7911309  chr1      0   14361  29370
# 3:  Rest_5 uc001aah.4 8171066  chr1      0   14361  29370
# 4:  Rest_6 uc001aah.4 8159790  chr1      0   14361  29370
# 5: Rest_17 uc001abw.1 7896761  chr1      0  861120 879961
# 6: Rest_18 uc001abx.1 7896761  chr1      0  871151 879961

# step 1: remove duplicate ID2 and concatenate ID and probe.
# Note: here I assume that if ID2 is same, then so will be chrom, 
# strand, txStart and txEnd. If not, you can modify this similar 
# to what is in step 2.
dt.out <- dt[, lapply(.SD, function(x) paste(x, collapse=",")), 
          by=c("ID2", "chrom", "strand", "txStart", "txEnd")]

#           ID2 chrom strand txStart  txEnd                          ID                           probe
# 1: uc001aah.4  chr1      0   14361  29370 Rest_3,Rest_4,Rest_5,Rest_6 8044649,7911309,8171066,8159790
# 2: uc001abw.1  chr1      0  861120 879961                     Rest_17                         7896761
# 3: uc001abx.1  chr1      0  871151 879961                     Rest_18                         7896761

# step 2: remove duplicate probe and concatenate others, get min(txStart) and max(txEnd)
dt.out <- dt.out[ ,list(ID=paste(ID, collapse=","), ID2=paste(ID2, collapse=","), 
                       txStart=min(txStart), txEnd=max(txEnd)), 
                       by=c("probe", "chrom", "strand")]

#                              probe chrom strand                          ID                   ID2 txStart  txEnd
# 1: 8044649,7911309,8171066,8159790  chr1      0 Rest_3,Rest_4,Rest_5,Rest_6            uc001aah.4   14361  29370
# 2:                         7896761  chr1      0             Rest_17,Rest_18 uc001abw.1,uc001abx.1  861120 879961

这篇关于在相邻列中平均值时重复输入池的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在相邻列中平均值时重复输入池 [英] Duplicate entry pooling while averaging values in neighbouring columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在相邻列中平均值时重复输入池 [英] Duplicate entry pooling while averaging values in neighbouring columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭