使用data.table函数foverlaps查找两个表中重叠范围的交集 [英] Find the intersection of overlapping ranges in two tables using data.table function foverlaps

查看:104
本文介绍了使用data.table函数foverlaps查找两个表中重叠范围的交集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 foverlaps 查找两个床文件的相交范围,并将包含重叠范围的任何行折叠为单行.在下面的示例中,我有两个具有基因组范围的表.这些表称为床"文件,这些文件在染色体中具有从零开始的坐标和从一开始的特征的结束位置.例如,START = 9,STOP = 20被解释为跨越10到20的底数(包括10和20).这些床文件可以包含数百万行.无论提供两个要相交的文件的顺序如何,解决方案都需要给出相同的结果.

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.

第一张桌子

> table1
   CHROMOSOME START STOP
1:          1     1   10
2:          1    20   50
3:          1    70  130
4:          X     1   20
5:          Y     5  200

第二张表

> table2
   CHROMOSOME START STOP
1:          1     5   12
2:          1    15   55
3:          1    60   65
4:          1   100  110
5:          1   130  131
6:          X    60   80
7:          Y     1   15
8:          Y    10   50

我在想,新的foverlaps函数可能是在这两个表中找到相交范围以生成如下表的快速方法:

I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:

结果表:

> resultTable
   CHROMOSOME START STOP
1:          1     5   10
2:          1    20   50
3:          1   100  110
4:          Y     5   50  

有可能吗?或者在data.table中有更好的方法吗?

Is that possible, or is there a better way to do that in data.table?

我还要首先确认在一个表中,对于任何给定的染色体,STOP坐标与下一行的起始坐标不重叠.例如,需要将CHROMOSOME Y:1-15和CHROMOSOME Y:10-50折叠为CHROMOSOME Y:1-50(请参见第二表第7和8行).事实并非如此,但该功能可能应该对此进行检查.现实生活中如何重叠潜在重叠的示例如下:

I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:

   CHROM  START   STOP
1:     1 721281 721619
2:     1 721430 721906
3:     1 721751 722042

所需的输出:

   CHROM  START   STOP
1:     1 721281 722042

创建示例表的功能如下:

Functions to create example tables are as follows:

table1 <- data.table(
   CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
   START = c(1,20,70,1,5) ,
   STOP = c(10,50,130,20,200)
)

table2 <- data.table(
   CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
   START = c(5,15,60,100,130,60,1,10) ,
   STOP = c(12,55,65,110,131,80,15,50)
 )

推荐答案

@Seth提供了使用data.table foverlaps函数来解决交集重叠问题的最快方法.但是,此解决方案未考虑输入床文件可能具有重叠范围的事实,需要将这些范围缩小为单个区域. @Martin Morgan通过使用GenomicRanges软件包的解决方案解决了这一问题,该方法既减少了相交又减少了范围.但是,马丁的解决方案没有使用保险杠功能. @Arun指出,当前无法使用foverlaps来实现表中不同行中的重叠范围.多亏了提供的答案以及对stackoverflow的一些其他研究,我才想到了这种混合解决方案.

@Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. @Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. @Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.

创建示例BED文件,每个文件中没有重叠区域.

Create example BED files without overlapping regions within each file.

chr <- c(1:22,"X","Y","MT")

#bedA contains 5 million rows
bedA <- data.table(
    CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
    START = rep(as.integer(seq(1,200000000,1000)),25),
    STOP = rep(as.integer(seq(500,200000000,1000)),25),
    key = c("CHROM","START","STOP")
    )

#bedB contains 500 thousand rows
bedB <- data.table(
  CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
  START = rep(as.integer(seq(200,200000000,10000)),25),
  STOP = rep(as.integer(seq(600,200000000,10000)),25),
  key = c("CHROM","START","STOP")
)

现在创建一个新的bed文件,其中包含bedA和bedB中的相交区域.

Now create a new bed file containing the intersecting regions in bedA and bedB.

#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))

user  system elapsed 
1.25    0.02    1.37 

#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))

user  system elapsed 
12.95    0.06   13.04 

identical(tmpA,tmpB)
[1] TRUE

现在,修改bedA和bedB使其包含重叠区域:

Now, modify bedA and bedB such that they contain overlapping regions:

#Create overlapping ranges
makeOverlaps <-  as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]

使用foverlaps或GenomicRanges功能测试与重叠范围的床文件相交的时间.

Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.

#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))

user  system elapsed 
1.83    0.05    1.89 

#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))

user  system elapsed 
12.95    0.04   12.99 

identical(tmpC,tmpD)
[1] TRUE

获胜者:圈盖

已使用的功能

这是基于翻转的函数,只有在存在重叠范围(已使用rowShift函数进行检查)时,才会调用GenomicRanges函数(reduceBed.GenomicRanges).

This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).

intersectBedFiles.foverlaps <- function(bed1,bed2) {
  require(data.table)
  bedKey <- c("CHROM","START","STOP")
  if(nrow(bed1)>nrow(bed2)) {
    bed <- foverlaps(bed1, bed2, nomatch = 0)
  } else {
    bed <- foverlaps(bed2, bed1, nomatch = 0)
  }
  bed[, START := pmax(START, i.START)]
  bed[, STOP := pmin(STOP, i.STOP)]
  bed[, `:=`(i.START = NULL, i.STOP = NULL)]
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
    bed <- reduceBed.GenomicRanges(bed)
  }
  return(bed)
}

rowShift <- function(x, shiftLen = 1L) {
  #Note this function was described in this thread:
  #http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

reduceBed.GenomicRanges <- function(bed) {
  setnames(bed,colnames(bed),bedKey)
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  grBed <- makeGRangesFromDataFrame(bed,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grBed <- reduce(grBed)
  grBed <- data.table(
    CHROM=as.character(seqnames(grBed)),
    START=start(grBed),
    STOP=end(grBed),
    key = c("CHROM","START","STOP"))
  return(grBed)
}

此函数严格使用GenomicRanges软件包,产生相同的结果,但比起foverlaps函数慢大约10倍.

This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.

intersectBedFiles.GR <- function(bed1,bed2) {
  require(data.table)
  require(GenomicRanges)
  bed1 <- makeGRangesFromDataFrame(bed1,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  bed2 <- makeGRangesFromDataFrame(bed2,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grMerge <- suppressWarnings(intersect(bed1,bed2))
  resultTable <- data.table(
    CHROM=as.character(seqnames(grMerge)),
    START=start(grMerge),
    STOP=end(grMerge),
    key = c("CHROM","START","STOP"))
  return(resultTable)
}

使用IRanges的其他比较

我找到了使用IRanges折叠重叠区域的解决方案,但是它比GenomicRanges慢10倍以上.

I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.

reduceBed.IRanges <- function(bed) {
  bed.tmp <- bed
  bed.tmp[,group := { 
      ir <-  IRanges(START, STOP);
      subjectHits(findOverlaps(ir, reduce(ir)))
    }, by=CHROM]
  bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM), 
              START=min(START), 
              STOP=max(STOP)),
       by=list(group,CHROM)]
  setkeyv(bed.tmp,bedKey)
  bed[,group := NULL]
  return(bed.tmp[, -(1:2)])
}


system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))

user  system elapsed 
10.86    0.01   10.89 

system.time(bedD.reduced <- reduceBed.IRanges(bedC))

user  system elapsed 
137.12    0.14  137.58 

identical(bedC.reduced,bedD.reduced)
[1] TRUE

这篇关于使用data.table函数foverlaps查找两个表中重叠范围的交集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆