是否可以使用R data.table函数foverlaps来查找两个表中的重叠范围的交集? [英] Is it possible to use the R data.table funcion foverlaps to find the intersection of overlapping ranges in two tables?

查看:987
本文介绍了是否可以使用R data.table函数foverlaps来查找两个表中的重叠范围的交集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用foverlaps来查找两个床文件的相交范围,并将包含重叠范围的所有行折叠到单个行中。在下面的示例中,我有两个基因组范围的表。这些表称为床文件,其具有基于零的开始坐标和基于一个的染色体中的特征的结束位置。例如,START = 9,STOP = 20被解释为跨越基数10到20,包括10和20。这些床文件可以包含数百万行。



第一个表

 > table1 
CHROMOSOME START STOP
1:1 1 10
2:1 20 50
3:1 70 130
4:X 1 20
5: Y 5 200

第二表

 > table2 
CHROMOSOME START STOP
1:1 5 12
2:1 15 55
3:1 60 65
4:1 100 110
5: 1 130 131
6:X 60 80
7:Y 1 15
8:Y 10 50

我认为新的foverlaps函数可以是一个非常快速的方法来找到这两个表中的相交范围,以产生一个表如下:



结果表:

 > resultTable 
CHROMOSOME START STOP
1:1 5 10
2:1 20 50
3:1 100 110
4:Y 5 50

这是可能的,还是在data.table中有更好的方法?



我也想首先确认在一个表中,对于任何给定的CHROMOSOME,停止坐标不与下一行的开始坐标重叠。例如,染色体Y:1-15和染色体Y:10-50将需要被萎缩为染色体Y:1-50(参见第二表行7和8)。这不应该是这样,但是函数应该可以检查。关于潜在重叠应如何崩溃的现实生活示例如下:



CHROM START STOP
1:1 721281 721619
2:1 721430 721906
3:1 721751 722042



所需输出:



CHROM START STOP
1 :1 721281 722042



创建示例表的函数如下:

  table1 < -  data.table(
CHROMOSOME = ascharacter(c(1,1,1,X,Y)),
START = c(1,20,70,1,5),
STOP = c(10,50,130,20,200)


table2< - data.table(
CHROMOSOME = as.character(c(1,1,1,1,1,X,Y,Y)),
START = c(5,15,60,100,130,60,1,10),
STOP = c(12,55,65,110,131,80,15,50)

p>

解决方案

@Seth提供了使用data.table foverlaps函数解决交叉重叠问题的最快方法。然而,该解决方案没有考虑到输入床文件可能具有需要被缩减为单个区域的重叠范围的事实。 @Martin Morgan解决了他的解决方案使用GenomicRanges软件包,做了交叉和范围减少。然而,Martin的解决方案没有使用foverlaps函数。 @Arun指出,表中不同行的重叠范围当前不可能使用foverlaps。由于提供的答案和一些额外的stackoverflow研究,我想出了这个混合解决方案。



创建示例BED文件,在每个文件中没有重叠区域。 >

  chr < -  c(1:22,X,Y,MT)

#bedA包含500万行
bedA < - data.table(
CHROM = as.vector(sapply(chr,function(x)rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c(CHROM,START,STOP)


#bedB包含50万行
bedB START = rep(as.integer(seq(200,200000000,10000)), b $ b STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c(CHROM,START,STOP)

现在创建一个包含bedA和bedB中交叉区域的新床文件。

 #此解决方案使用foverlaps 
system.time(tmpA < - intersectBedFiles.foverlaps(bedA,bedB))

用户系统已经过
1.25 0.02 1.37

#此解决方案使用GenomicRanges
system.time(tmpB < - intersectBedFiles.GR(bedA,bedB))

用户系统已过
12.95 0.06 13.04

相同(tmpA,tmpB)
[1] TRUE

现在,修改bedA和bedB,使其包含重叠区域:

 #创建重叠范围
makeOverlaps< - as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC < - bedA [,STOP:= STOP + makeOverlaps,by = CHROM]
bedD < - bedB [,STOP:= STOP + makeOverlaps,by = CHROM]

使用foverlaps或GenomicRanges函数测试时间,以交叉重叠范围的床文件。

  #此解决方案使用foverlaps来找到交集,然后在结果上运行GenomicRanges 
system.time(tmpC < - intersectBedFiles.foverlaps(bedC,bedD))

用户系统已过
1.83 0.05 1.89

#此解决方案使用GenomicRanges
system.time(tmpD < - intersectBedFiles.GR(bedC,bedD))

用户系统已过
12.95 0.04 12.99

identical(tmpC,tmpD)
[1] TRUE

胜利者: foverlaps



使用的功能
$ b

这是基于foverlaps的函数,如果有重叠范围(使用rowShift函数检查),它将只调用GenomicRanges函数(reduceBed.GenomicRanges)。

  intersectBedFiles.foverlaps<  -  function(bed1,bed2){
require(data.table)
bedKey&如果(nrow(bed1)> nrow(bed2)){
bed< - foverlaps(bed1,bed2,nomatch = 0),c(CHROM,START,STOP
} else {
bed< - foverlaps(bed2,bed1,nomatch = 0)
}
bed [,START:= pmax(START,i.START)]
bed [,STOP:= pmin(STOP,i.STOP)]
bed [,`:=`(i.START = NULL,i.STOP = NULL)]
if setkeyv(bed,bedKey)
if(any(bed [,STOP + 1> = rowShift(START),by = CHROM] [,V1],na.rm = T)){
bed< - reductionBed.GenomicRanges(bed)
}
return(bed)
}

rowShift& (x,shiftLen = 1L){
#注意这个函数在这个线程中描述:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous -row-in-an-r-data-table-calculation
r < - (1L + shiftLen):( length(x)+ shiftLen)
r [r <1] < - NA
return(x [r])
}

reduceBed.GenomicRanges < - function(bed){
setnames(bed,colnames(bed),bedKey)
if(!same(key(bed),bedKey))setkeyv(bed,bedKey)
grBed< - makeGRangesFromDataFrame(bed,
seqnames.field =CHROM,start.field = START,end.field =STOP)
grBed< - reduce(grBed)
grBed< - data.table(
CHROM = as.character(seqnames(grBed)) ,
START = start(grBed),
STOP = end(grBed),
key = c(CHROM,START,STOP))
return grBed)
}

此函数严格使用GenomicRanges包,产生相同的结果,是foverlaps函数的10倍。

  intersectBedFiles.GR<  -  function(bed1,bed2){
require(data.table)
require(GenomicRanges)
bed1< - makeGRangesFromDataFrame(bed1,
seqnames.field =CHROM,start.field =START =STOP)
bed2< - makeGRangesFromDataFrame(bed2,
seqnames.field =CHROM,start.field =START,end.field =STOP)
grMerge < - suppressWarnings(intersect(bed1,bed2))
resultTable< - data.table(
CHROM = as.character(seqnames(grMerge)),
START = start(grMerge) ,
STOP = end(grMerge),
key = c(CHROM,START,STOP))
return(resultTable)
}



使用IRanges的额外比较



我发现一个解决方案使用IRanges折叠重叠区域,但它比GenomicRanges慢10多倍。

  reduceBed.IRanges<  -  function(bed){
bed.tmp< - bed
bed.tmp [,group:= {
ir < - IRanges(START,STOP);
subjectHits(findOverlaps(ir,reduce(ir)))
} by = CHROM]
bed.tmp < - bed.tmp [,list(CHROM = unique
START = min(START),
STOP = max(STOP)),
by = list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed [,group:= NULL]
return(bed.tmp [, - c(1:2),with = F])
}

$ b b system.time(bedC.reduced < - reductionBed.GenomicRanges(bedC))

用户系统已经过
10.86 0.01 10.89

system.time减少<-reducedBed.IRanges(bedC))

用户系统已经过
137.12 0.14 137.58

相同(bedC.reduced,bedD.reduced)
[1] TRUE


I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.

First Table

> table1
   CHROMOSOME START STOP
1:          1     1   10
2:          1    20   50
3:          1    70  130
4:          X     1   20
5:          Y     5  200

Second Table

> table2
   CHROMOSOME START STOP
1:          1     5   12
2:          1    15   55
3:          1    60   65
4:          1   100  110
5:          1   130  131
6:          X    60   80
7:          Y     1   15
8:          Y    10   50

I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:

Result Table:

> resultTable
   CHROMOSOME START STOP
1:          1     5   10
2:          1    20   50
3:          1   100  110
4:          Y     5   50  

Is that possible, or is there a better way to do that in data.table?

I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:

CHROM START STOP 1: 1 721281 721619 2: 1 721430 721906 3: 1 721751 722042

Desired output:

CHROM START STOP 1: 1 721281 722042

Functions to create example tables are as follows:

table1 <- data.table(
   CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
   START = c(1,20,70,1,5) ,
   STOP = c(10,50,130,20,200)
)

table2 <- data.table(
   CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
   START = c(5,15,60,100,130,60,1,10) ,
   STOP = c(12,55,65,110,131,80,15,50)
 )

解决方案

@Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. @Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. @Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.

Create example BED files without overlapping regions within each file.

chr <- c(1:22,"X","Y","MT")

#bedA contains 5 million rows
bedA <- data.table(
    CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
    START = rep(as.integer(seq(1,200000000,1000)),25),
    STOP = rep(as.integer(seq(500,200000000,1000)),25),
    key = c("CHROM","START","STOP")
    )

#bedB contains 500 thousand rows
bedB <- data.table(
  CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
  START = rep(as.integer(seq(200,200000000,10000)),25),
  STOP = rep(as.integer(seq(600,200000000,10000)),25),
  key = c("CHROM","START","STOP")
)

Now create a new bed file containing the intersecting regions in bedA and bedB.

#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))

user  system elapsed 
1.25    0.02    1.37 

#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))

user  system elapsed 
12.95    0.06   13.04 

identical(tmpA,tmpB)
[1] TRUE

Now, modify bedA and bedB such that they contain overlapping regions:

#Create overlapping ranges
makeOverlaps <-  as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]

Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.

#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))

user  system elapsed 
1.83    0.05    1.89 

#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))

user  system elapsed 
12.95    0.04   12.99 

identical(tmpC,tmpD)
[1] TRUE

The winner: foverlaps!

FUNCTIONS USED

This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).

intersectBedFiles.foverlaps <- function(bed1,bed2) {
  require(data.table)
  bedKey <- c("CHROM","START","STOP")
  if(nrow(bed1)>nrow(bed2)) {
    bed <- foverlaps(bed1, bed2, nomatch = 0)
  } else {
    bed <- foverlaps(bed2, bed1, nomatch = 0)
  }
  bed[, START := pmax(START, i.START)]
  bed[, STOP := pmin(STOP, i.STOP)]
  bed[, `:=`(i.START = NULL, i.STOP = NULL)]
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
    bed <- reduceBed.GenomicRanges(bed)
  }
  return(bed)
}

rowShift <- function(x, shiftLen = 1L) {
  #Note this function was described in this thread:
  #http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

reduceBed.GenomicRanges <- function(bed) {
  setnames(bed,colnames(bed),bedKey)
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  grBed <- makeGRangesFromDataFrame(bed,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grBed <- reduce(grBed)
  grBed <- data.table(
    CHROM=as.character(seqnames(grBed)),
    START=start(grBed),
    STOP=end(grBed),
    key = c("CHROM","START","STOP"))
  return(grBed)
}

This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.

intersectBedFiles.GR <- function(bed1,bed2) {
  require(data.table)
  require(GenomicRanges)
  bed1 <- makeGRangesFromDataFrame(bed1,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  bed2 <- makeGRangesFromDataFrame(bed2,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grMerge <- suppressWarnings(intersect(bed1,bed2))
  resultTable <- data.table(
    CHROM=as.character(seqnames(grMerge)),
    START=start(grMerge),
    STOP=end(grMerge),
    key = c("CHROM","START","STOP"))
  return(resultTable)
}

An additional comparison using IRanges

I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.

reduceBed.IRanges <- function(bed) {
  bed.tmp <- bed
  bed.tmp[,group := { 
      ir <-  IRanges(START, STOP);
      subjectHits(findOverlaps(ir, reduce(ir)))
    }, by=CHROM]
  bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM), 
              START=min(START), 
              STOP=max(STOP)),
       by=list(group,CHROM)]
  setkeyv(bed.tmp,bedKey)
  bed[,group := NULL]
  return(bed.tmp[,-c(1:2),with=F])
}


system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))

user  system elapsed 
10.86    0.01   10.89 

system.time(bedD.reduced <- reduceBed.IRanges(bedC))

user  system elapsed 
137.12    0.14  137.58 

identical(bedC.reduced,bedD.reduced)
[1] TRUE

这篇关于是否可以使用R data.table函数foverlaps来查找两个表中的重叠范围的交集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆