折叠相交区域 [英] Collapse intersecting regions

查看:59
本文介绍了折叠相交区域的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到一种方法来折叠具有相交范围的行(由开始"和停止"列表示),并将折叠后的值记录到新列中.例如,我有以下数据框:

I am trying to find a way to collapse rows with intersecting ranges, denoted by "start" and "stop" columns, and record the collapsed values into new columns. For example I have this data frame:

my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), name=c("a","b","c","d","e","f","g"), start=as.numeric(c(0,70001,70203,70060, 40004, 50000872, 50000872)), stop=as.numeric(c(71200,71200,80001,71051, 42004, 50000890, 51000952)))


chrom name  start   stop
 1    a        0    71200
 1    b    70001    71200
 1    c    70203    80001
 1    d    70060    71051
14    e    40004    42004
16    f 50000872 50000890
16    g 50000872 51000952

我正在尝试查找重叠范围,并在开始"和停止"中记录折叠的重叠行所覆盖的最大范围,并记录折叠的行的名称,所以我会得到以下信息:

And I am trying to find the overlapping ranges and record the biggest range covered by the collapsed overlapping rows in "start" and "stop" and the names of the collapsed rows, so I would get this:

chrom start   stop      name
 1    70001    80001    a,b,c,d
14    40004    42004    e
16    50000872 51000952 f,g

我想我可以像这样使用IRanges软件包:

I think I could use the packages IRanges like this:

library(IRanges)
ranges <- split(IRanges(my.df$start, my.df$stop), my.df$chrom)

但是然后我很难获取折叠列:我已经尝试过findOvarlaps,但是

But then I have trouble getting the collapsed columns: I have tried with findOvarlaps but this

ov <- findOverlaps(ranges, ranges, type="any")

但是我认为这是不对的.

but I don't think this is right.

任何帮助将不胜感激.

推荐答案

IRanges是此类工作的理想人选.无需使用chrom变量.

IRanges is a good candidate for such job. No need to use chrom variable.

ir <- IRanges(my.df$start, my.df$stop)
## I create a new grouping variable Note the use of reduce here(performance issue)
my.df$group2 <- subjectHits(findOverlaps(ir, reduce(ir)))
# chrom name    start     stop group2
# 1     1    a    70001    71200      2
# 2     1    b    70203    80001      2
# 3     1    c    70060    71051      2
# 4    14    d    40004    42004      1
# 5    16    e 50000872 50000890      3
# 6    16    f 50000872 51000952      3

新的group2变量是范围指示器.现在使用data.table,我无法将数据转换为所需的输出:

The new group2 variable is the range indicator. Now using data.table I can't transform my data to the desired output:

library(data.table)
DT <- as.data.table(my.df)
DT[, list(start=min(start),stop=max(stop),
         name=list(name),chrom=unique(chrom)),
               by=group2]

# group2    start     stop  name chrom
# 1:      2    70001    80001 a,b,c     1
# 2:      1    40004    42004     d    14
# 3:      3 50000872 51000952   e,f    16

PS:此处的折叠变量名称不是字符串,而是因数的列表.这比使用粘贴的协作角色更有效且更易于访问.

PS: the collapsed variable name here is not string but a list of factor. This is more efficient and easier to access than a collapased character using paste for example.

编辑,在对OP进行了说明之后,我将通过chrom创建组变量.我的意思是现在为每个色度组调用Iranges代码.我稍加修改您的数据,以创建同一条染色体的间隔组.

EDIT after OP clarification, I will create the group variable by chrom. I mean the Iranges code now is called for each chrom group. I slightly modify your data, to create group of intervals the same chromosome.

my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), 
                   name=c("a","b","c","d","e","f","g"),
                   start=as.numeric(c(0,3000,70203,70060, 40004, 50000872, 50000872)), 
                   stop=as.numeric(c(1,5000,80001,71051, 42004, 50000890, 51000952)))

library(data.table)
DT <- as.data.table(my.df)

## find interval for each chromsom
DT[,group := { 
      ir <-  IRanges(start, stop);
       subjectHits(findOverlaps(ir, reduce(ir)))
      },by=chrom]

## Now I group by group and chrom 
DT[, list(start=min(start),stop=max(stop),name=list(name),chrom=unique(chrom)),
   by=list(group,chrom)]

  group chrom    start     stop name chrom
1:     1     1        0        1    a     1
2:     2     1     3000     5000    b     1
3:     3     1    70060    80001  c,d     1
4:     1    14    40004    42004    e    14
5:     1    16 50000872 51000952  f,g    16

这篇关于折叠相交区域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆