R-将数据帧行分为两行 [英] R - Split data frame row into two rows

查看:106
本文介绍了R-将数据帧行分为两行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个表格(下面的数据和参考;玩具示例)。这些表的START和END位置我想检查是否有重叠(使用data.table包中的foverlaps之类的东西),然后按如下所示拆分值。

I have a 2 tables (data & reference; toy example below). These tables have START and END positions that I'd like to check for overlaps (using something like foverlaps from the data.table package) and then split the values as shows below.

>data  <- data.table(ID=c(1,2,3), Chrom=c(1,1,2), Start=c(1,500,1000), End=c(900,5000,5000), Probes=c(899,4500,4500))
>Ref.table <- data.table(Chrom=c(1,2), Split=c(1000,2000))

>Ref.table
Chrom    Split
1        1000
2        2000

>data
ID    Chrom    Start    End    Probes
1     1        1        900    899
2     1        500      5000   4500
3     2        1000     5000   4000

如您所见,ID 1与参考表没有重叠,因此将单独使用。但是,我想根据参考表拆分ID 2和3。

As you can see, ID 1 has no overlap with the reference table, so it would be left alone. However, IDs 2&3, I'd like to split based on Ref.table.

我想要得到的结果表是:

The resulting table I'd like to get is:

>result
ID    Chrom    Start    End    Probes
1     1        1        900    899
2     1        500      1000   500
2     1        1001     5000   4000
3     2        1000     2000   1000
3     2        2001     5000   3000

我确定您可以看到,分为两部分:
1.根据单独的表
将范围分为两列。2.在这两个部分之间按比例划分#个探测

我一直在寻找可以做到这一点的R程序包(通过染色体臂分割范围),但找不到如上所述的程序。任何与功能包的链接都将不胜感激,但是我也愿意自己编写代码……在一点帮助下。

I've been searching for an R package that can do this (split ranges by Chromosome arm), but haven't been able to find one that does as shown above. Any links to functions packages would be appreciated, but I'm also willing to code this myself...with a little help.

到目前为止,我只是能够使用Foverlaps确定是否存在重叠:
示例:

So far, I've only been able to use foverlaps to determine if there are overlaps: example:

>foverlaps(Ref.table[data[14]$Chrom], data[14], which=TRUE)
     xid   yid
1:    1     1


推荐答案

这是一个可能的翻盖解决方案(如Q中所述)。

Here's a possible foverlaps solution (as mentioned in the Q).

前两个步骤很简单,而且很惯用,在 Ref.table 中添加 End 列我们将有重叠的时间间隔,然后通过 Chrom 和时间间隔列来键入两个数据集(在v 1.9.5+中,您现在可以指定 by.x by.y ),只需运行翻盖

The first two steps are simple and pretty much idiomatic, add an End column to Ref.table so we will have overlaping intervals, then key both data sets by Chrom and the interval columns (in v 1.9.5+ you can now specify by.x and by.y instead) and simply run foverlaps

library(data.table)
setDT(Ref.table)[, End := Split]
setkey(Ref.table)
setkey(setDT(data), Chrom, Start, End)
res <- foverlaps(data, Ref.table)
res
#    Chrom Split  End ID Start i.End Probes
# 1:     1    NA   NA  1     1   900    899
# 2:     1  1000 1000  2   500  5000   4500
# 3:     2  2000 2000  3  1000  5000   4000

现在我们有了重叠,我们需要根据匹配增加数据集的大小。我们可以在 is.na(Split)上设置条件(这意味着未发现任何重叠)。我不确定这部分是否可以更有效地完成

Now that we have the overlaps, we need to increase the data set size according to our matches. We can condition this on is.na(Split) (which means no overlap was found). I'm not sure if this part could be done more efficiently

res2 <- res[, if(is.na(Split)) .SD else rbind(.SD, .SD), by = .(ID, Chrom)]
## Or, if you only have one row per group, maybe
## res2 <- res[, if(is.na(Split)) .SD else .SD[c(1L,1L)], by = .(ID, Chrom)]

现在,最后两个步骤将更新 End Start 列,然后根据新列值选择 Probes

Now, the last two steps will update the End and Start columns and then the Probes column according to the new column values

res2[!is.na(Split), `:=`(i.End = c(Split[1L], i.End[-1L]),
                         Start = c(Start[-1L], Split[1L] + 1L)), 
     by = .(ID, Chrom)]
res2[!is.na(Split), Probes := i.End - Start]
res2
#    ID Chrom Split  End Start i.End Probes
# 1:  1     1    NA   NA     1   900    899
# 2:  2     1  1000 1000   500  1000    500
# 3:  2     1  1000 1000  1001  5000   3999
# 4:  3     2  2000 2000  1000  2000   1000
# 5:  3     2  2000 2000  2001  5000   2999

(您可以删除不需要的列)

(You can remove unwanted columns if you wish)

这篇关于R-将数据帧行分为两行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆