使用 data.table 函数 foverlaps 查找两个表中重叠范围的交集 [英] Find the intersection of overlapping ranges in two tables using data.table function foverlaps

查看:17
本文介绍了使用 data.table 函数 foverlaps 查找两个表中重叠范围的交集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 foverlaps 来查找两个床文件的相交范围,并将任何包含重叠范围的行折叠成一行.在下面的示例中,我有两个包含基因组范围的表.这些表格被称为床"文件,它们在染色体中具有从零开始的坐标和从一开始的特征结束位置.例如,START=9、STOP=20 被解释为跨越基数 10 到 20,包括 10 到 20.这些床文件可以包含数百万行.无论提供要相交的两个文件的顺序如何,该解决方案都需要给出相同的结果.

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.

第一个表

> table1
   CHROMOSOME START STOP
1:          1     1   10
2:          1    20   50
3:          1    70  130
4:          X     1   20
5:          Y     5  200

第二张桌子

> table2
   CHROMOSOME START STOP
1:          1     5   12
2:          1    15   55
3:          1    60   65
4:          1   100  110
5:          1   130  131
6:          X    60   80
7:          Y     1   15
8:          Y    10   50

我在想新的 foverlaps 函数可能是一种非常快速的方法,可以在这两个表中找到相交范围以生成如下所示的表:

I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:

结果表:

> resultTable
   CHROMOSOME START STOP
1:          1     5   10
2:          1    20   50
3:          1   100  110
4:          Y     5   50  

这可能吗,或者在 data.table 中有更好的方法吗?

Is that possible, or is there a better way to do that in data.table?

我还想首先确认在一个表中,对于任何给定的 CHROMOSOME,STOP 坐标不与下一行的起始坐标重叠.例如,CHROMOSOME Y:1-15 和 CHROMOSOME Y:10-50 需要折叠为 CHROMOSOME Y:1-50(参见第二个表第 7 行和第 8 行).这不应该是这种情况,但该功能可能应该检查这一点.下面是如何折叠潜在重叠的真实示例:

I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:

   CHROM  START   STOP
1:     1 721281 721619
2:     1 721430 721906
3:     1 721751 722042

期望的输出:

   CHROM  START   STOP
1:     1 721281 722042

创建示例表的功能如下:

Functions to create example tables are as follows:

table1 <- data.table(
   CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
   START = c(1,20,70,1,5) ,
   STOP = c(10,50,130,20,200)
)

table2 <- data.table(
   CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
   START = c(5,15,60,100,130,60,1,10) ,
   STOP = c(12,55,65,110,131,80,15,50)
 )

推荐答案

@Seth 使用 data.table foverlaps 函数提供了最快的解决交叉重叠问题的方法.然而,这个解决方案没有考虑到输入的床文件可能有重叠的范围,需要减少到单个区域.@Martin Morgan 通过使用 GenomicRanges 包的解决方案解决了这个问题,该包同时进行了相交和范围缩小.但是,Martin 的解决方案没有使用 foverlaps 函数.@Arun 指出,表中不同行的重叠范围目前无法使用 foverlaps.感谢提供的答案,以及对 stackoverflow 的一些额外研究,我想出了这个混合解决方案.

@Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. @Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. @Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.

在每个文件中创建没有重叠区域的示例 BED 文件.

Create example BED files without overlapping regions within each file.

chr <- c(1:22,"X","Y","MT")

#bedA contains 5 million rows
bedA <- data.table(
    CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
    START = rep(as.integer(seq(1,200000000,1000)),25),
    STOP = rep(as.integer(seq(500,200000000,1000)),25),
    key = c("CHROM","START","STOP")
    )

#bedB contains 500 thousand rows
bedB <- data.table(
  CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
  START = rep(as.integer(seq(200,200000000,10000)),25),
  STOP = rep(as.integer(seq(600,200000000,10000)),25),
  key = c("CHROM","START","STOP")
)

现在创建一个新的bed文件,其中包含bedA和bedB中的相交区域.

Now create a new bed file containing the intersecting regions in bedA and bedB.

#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))

user  system elapsed 
1.25    0.02    1.37 

#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))

user  system elapsed 
12.95    0.06   13.04 

identical(tmpA,tmpB)
[1] TRUE

现在,修改 bedA 和 bedB 使其包含重叠区域:

Now, modify bedA and bedB such that they contain overlapping regions:

#Create overlapping ranges
makeOverlaps <-  as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]

使用 foverlaps 或 GenomicRanges 函数测试具有重叠范围的床文件相交的时间.

Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.

#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))

user  system elapsed 
1.83    0.05    1.89 

#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))

user  system elapsed 
12.95    0.04   12.99 

identical(tmpC,tmpD)
[1] TRUE

获胜者:foverlaps

使用的功能

这是基于 foverlaps 的函数,只有在存在重叠范围(使用 rowShift 函数检查)时才会调用 GenomicRanges 函数 (reduceBed.GenomicRanges).

This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).

intersectBedFiles.foverlaps <- function(bed1,bed2) {
  require(data.table)
  bedKey <- c("CHROM","START","STOP")
  if(nrow(bed1)>nrow(bed2)) {
    bed <- foverlaps(bed1, bed2, nomatch = 0)
  } else {
    bed <- foverlaps(bed2, bed1, nomatch = 0)
  }
  bed[, START := pmax(START, i.START)]
  bed[, STOP := pmin(STOP, i.STOP)]
  bed[, `:=`(i.START = NULL, i.STOP = NULL)]
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
    bed <- reduceBed.GenomicRanges(bed)
  }
  return(bed)
}

rowShift <- function(x, shiftLen = 1L) {
  #Note this function was described in this thread:
  #http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

reduceBed.GenomicRanges <- function(bed) {
  setnames(bed,colnames(bed),bedKey)
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  grBed <- makeGRangesFromDataFrame(bed,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grBed <- reduce(grBed)
  grBed <- data.table(
    CHROM=as.character(seqnames(grBed)),
    START=start(grBed),
    STOP=end(grBed),
    key = c("CHROM","START","STOP"))
  return(grBed)
}

此函数严格使用 GenomicRanges 包,产生相同的结果,但比 foverlaps 函数慢约 10 倍.

This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.

intersectBedFiles.GR <- function(bed1,bed2) {
  require(data.table)
  require(GenomicRanges)
  bed1 <- makeGRangesFromDataFrame(bed1,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  bed2 <- makeGRangesFromDataFrame(bed2,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grMerge <- suppressWarnings(intersect(bed1,bed2))
  resultTable <- data.table(
    CHROM=as.character(seqnames(grMerge)),
    START=start(grMerge),
    STOP=end(grMerge),
    key = c("CHROM","START","STOP"))
  return(resultTable)
}

使用 IRanges 的附加比较

我找到了使用 IRanges 折叠重叠区域的解决方案,但它比 GenomicRanges 慢 10 倍以上.

I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.

reduceBed.IRanges <- function(bed) {
  bed.tmp <- bed
  bed.tmp[,group := { 
      ir <-  IRanges(START, STOP);
      subjectHits(findOverlaps(ir, reduce(ir)))
    }, by=CHROM]
  bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM), 
              START=min(START), 
              STOP=max(STOP)),
       by=list(group,CHROM)]
  setkeyv(bed.tmp,bedKey)
  bed[,group := NULL]
  return(bed.tmp[, -(1:2)])
}


system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))

user  system elapsed 
10.86    0.01   10.89 

system.time(bedD.reduced <- reduceBed.IRanges(bedC))

user  system elapsed 
137.12    0.14  137.58 

identical(bedC.reduced,bedD.reduced)
[1] TRUE

这篇关于使用 data.table 函数 foverlaps 查找两个表中重叠范围的交集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆