R-将数据帧行分为两行 [英] R - Split data frame row into two rows
问题描述
我有2个表格(下面的数据和参考;玩具示例)。这些表的START和END位置我想检查是否有重叠(使用data.table包中的foverlaps之类的东西),然后按如下所示拆分值。
I have a 2 tables (data & reference; toy example below). These tables have START and END positions that I'd like to check for overlaps (using something like foverlaps from the data.table package) and then split the values as shows below.
>data <- data.table(ID=c(1,2,3), Chrom=c(1,1,2), Start=c(1,500,1000), End=c(900,5000,5000), Probes=c(899,4500,4500))
>Ref.table <- data.table(Chrom=c(1,2), Split=c(1000,2000))
>Ref.table
Chrom Split
1 1000
2 2000
>data
ID Chrom Start End Probes
1 1 1 900 899
2 1 500 5000 4500
3 2 1000 5000 4000
如您所见,ID 1与参考表没有重叠,因此将单独使用。但是,我想根据参考表拆分ID 2和3。
As you can see, ID 1 has no overlap with the reference table, so it would be left alone. However, IDs 2&3, I'd like to split based on Ref.table.
我想要得到的结果表是:
The resulting table I'd like to get is:
>result
ID Chrom Start End Probes
1 1 1 900 899
2 1 500 1000 500
2 1 1001 5000 4000
3 2 1000 2000 1000
3 2 2001 5000 3000
我确定您可以看到,分为两部分:
1.根据单独的表
将范围分为两列。2.在这两个部分之间按比例划分#个探测
我一直在寻找可以做到这一点的R程序包(通过染色体臂分割范围),但找不到如上所述的程序。任何与功能包的链接都将不胜感激,但是我也愿意自己编写代码……在一点帮助下。
I've been searching for an R package that can do this (split ranges by Chromosome arm), but haven't been able to find one that does as shown above. Any links to functions packages would be appreciated, but I'm also willing to code this myself...with a little help.
到目前为止,我只是能够使用Foverlaps确定是否存在重叠:
示例:
So far, I've only been able to use foverlaps to determine if there are overlaps: example:
>foverlaps(Ref.table[data[14]$Chrom], data[14], which=TRUE)
xid yid
1: 1 1
推荐答案
这是一个可能的翻盖
解决方案(如Q中所述)。
Here's a possible foverlaps
solution (as mentioned in the Q).
前两个步骤很简单,而且很惯用,在 Ref.table
中添加 End 列我们将有重叠的时间间隔,然后通过 Chrom
和时间间隔列来键入两个数据集(在v 1.9.5+中,您现在可以指定 by.x
和 by.y
),只需运行翻盖
The first two steps are simple and pretty much idiomatic, add an End column to Ref.table
so we will have overlaping intervals, then key both data sets by Chrom
and the interval columns (in v 1.9.5+ you can now specify by.x
and by.y
instead) and simply run foverlaps
library(data.table)
setDT(Ref.table)[, End := Split]
setkey(Ref.table)
setkey(setDT(data), Chrom, Start, End)
res <- foverlaps(data, Ref.table)
res
# Chrom Split End ID Start i.End Probes
# 1: 1 NA NA 1 1 900 899
# 2: 1 1000 1000 2 500 5000 4500
# 3: 2 2000 2000 3 1000 5000 4000
现在我们有了重叠,我们需要根据匹配增加数据集的大小。我们可以在 is.na(Split)
上设置条件(这意味着未发现任何重叠)。我不确定这部分是否可以更有效地完成
Now that we have the overlaps, we need to increase the data set size according to our matches. We can condition this on is.na(Split)
(which means no overlap was found). I'm not sure if this part could be done more efficiently
res2 <- res[, if(is.na(Split)) .SD else rbind(.SD, .SD), by = .(ID, Chrom)]
## Or, if you only have one row per group, maybe
## res2 <- res[, if(is.na(Split)) .SD else .SD[c(1L,1L)], by = .(ID, Chrom)]
现在,最后两个步骤将更新 End
和 Start
列,然后根据新列值选择 Probes
列
Now, the last two steps will update the End
and Start
columns and then the Probes
column according to the new column values
res2[!is.na(Split), `:=`(i.End = c(Split[1L], i.End[-1L]),
Start = c(Start[-1L], Split[1L] + 1L)),
by = .(ID, Chrom)]
res2[!is.na(Split), Probes := i.End - Start]
res2
# ID Chrom Split End Start i.End Probes
# 1: 1 1 NA NA 1 900 899
# 2: 2 1 1000 1000 500 1000 500
# 3: 2 1 1000 1000 1001 5000 3999
# 4: 3 2 2000 2000 1000 2000 1000
# 5: 3 2 2000 2000 2001 5000 2999
(您可以删除不需要的列)
(You can remove unwanted columns if you wish)
这篇关于R-将数据帧行分为两行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!