基于范围重叠处理输入文件 [英] Processing the input file based on range overlap

查看:23
本文介绍了基于范围重叠处理输入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的输入文件(其代表性示例如下所示为 input):

I have a huge input file (a representative sample of which is shown below as input):

> input
           CT1           CT2           CT3
1 chr1:200-400  chr1:250-450  chr1:400-800
2 chr1:800-970  chr2:200-500  chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400

我想通过遵循一些规则(如下所述)来处理它,以便我得到一个 output 像:

I want to process it by following some rules (described below) so that I get an output like:

 > output
              CT1 CT2 CT3
chr1:200-400    1   1   0
chr1:800-970    1   0   0
chr2:300-700    1   1   0
chr1:250-450    1   1   0
chr2:200-500    1   1   0
chr2:600-1000   0   1   1
chr1:400-800    0   0   1
chr1:700-870    0   1   1
chr2:700-1400   0   1   1

规则:获取每个索引(在这种情况下,第一个索引是 chr1:200-400),看看它是否与其他列中的值显着重叠.我的意思是范围至少有 50% 的重叠.如果是,在它所在的那一列下面写1,如果不是,写0.

Rules: Take every index (the first in this case is chr1:200-400), see if it overlap significantly with values in other column. From significantly I mean atleast 50% overlap of the range. If yes, write 1 below that column in which it exists, if not write 0.

现在我解释一下我是如何得到输出表的.从我们的输入中,我们获取第一个索引 input[1,1],即 chr1:200-400.由于它存在于第 1 列中,我们将在其下方写 1.现在我们将检查此范围或重叠范围是否存在于任何其他列中.此值仅与第二列 (CT2) 的第一个值 (chr1:250-450) 重叠.现在我们检查这种重叠是否显着.我们知道范围是 200 (chr1:200-400),与第二列值 (chr1:250-450) 的重叠是 150 (250-400).由于 150 的重叠大于原始范围 (200-400 = 200) OR 重叠范围 (250-450 = 200) 的一半(原始范围的 50% = 100).我们认为它是重叠的,并且在 CT2 列下方也分配了 1.由于该范围不与 CT3 中的任何值重叠,因此我们在 CT3 下方写入 0.对于输出的第 9 行也是如此.chr2:700-1400CT1 中不存在,所以在它下面写 0.对于 CT2,它与 chr2:600-1000 重叠.这里原来的范围是700(chr2:700-1400),一半是350.与CT2chr2:700-1000重叠是300(超出实际范围 chr2:600-1000).现在这个 300 的重叠不大于实际范围 700 的一半(CT3 的 chr2:700-1400),但它大于重叠范围 400 的一半(CT2 的 chr2:600-1000).因此,我们将其视为重叠并在 CT2 下方写 1.由于这个范围实际上存在于 CT3 中,所以我们在其下方也写了 1.

Now I explain how I got the output table. From our input we take 1st index input[1,1] which is chr1:200-400. As it exists in column 1 we will write 1 below it. Now we will check if this or an overlapping range exists in any other column. This value overlaps only with the first value (chr1:250-450) of the second column (CT2). Now we check if this overlap is significant or not. We know that range is 200 (chr1:200-400), the overlap with 2nd column value (which is chr1:250-450) is 150 (250-400). As this overlap of 150 is greater than the half (50% of original range = 100) of original range (200-400 = 200) OR overlapping range (250-450 = 200). We consider it an overlap and assign 1 below the column CT2 as well. As this range does not overlap with any value in CT3, we write 0 below CT3. Similarly for row 9 of output. chr2:700-1400 doesn't exist in CT1 so write 0 below it. For CT2 it overlaps with chr2:600-1000. The original range here is 700 (chr2:700-1400), half of it is 350. The overlap with chr2:700-1000 of CT2 is 300 (out of actual range chr2:600-1000). Now this overlap of 300 is not greater than the half of actual range 700 (chr2:700-1400 of CT3) but it is greater than the half of overlapping range 400 (chr2:600-1000 of CT2). Therefore, we consider it an overlap and write 1 below CT2. As this range actually exists in CT3, we write 1 below it as well.

这里是输入和输出的dput:

Here are the dput of input and output:

> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450", 
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800", 
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1", 
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L, 
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L, 
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500", 
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))

推荐答案

要做到这一点,需要许多步骤,以及 data.table 包中的一些概念,最值得注意的是非 equi 连接.我在整个过程中都对代码进行了注释,如果您想了解更多信息,建议您逐步运行它:

To do this requires many steps, and a number of concepts from the data.table package, most notably non-equi joins. I've commented the code throughout, and recommend running it step by step if you want more insight:

library(data.table)

input <- structure(list(CT1 = structure(1:3, .Label = 
  c("chr1:200-400", "chr1:800-970", "chr2:300-700"), class = 
  "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450", 
  "chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = 
  structure(1:3, .Label = c("chr1:400-800", "chr1:700-870", 
  "chr2:700-1400"), class = "factor")), .Names = c("CT1", 
  "CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))

output <- structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L),
  CT2 = c(1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 
  0L, 0L, 1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = 
  "data.frame", row.names = c("chr1:200-400", "chr1:800-970", 
  "chr2:300-700", "chr1:250-450", "chr2:200-500", "chr2:600-1000",
  "chr1:400-800", "chr1:700-870", "chr2:700-1400"))

# Builds a data.table by breaking a string like "chr1:300-700" into 
# three columns: chr, start, and end.
split_genomic_range <- function(str) {
  chr <- gsub(":.*", "", str)
  start <- gsub("-.*", "", gsub(".*:", "", str))
  end <- gsub(".*-", "", str)

  start <- as.numeric(start)
  end <- as.numeric(end)

  return(data.table(chr=chr, start=start, end=end))
}

# First break the input data.table into three new tables - we will need
# to perform non-equi joins of the index table (column CT1 in input) to
# the tables built from the other two columns.
ct1 <- split_genomic_range(input$CT1)
ct2 <- split_genomic_range(input$CT2)
ct3 <- split_genomic_range(input$CT3)

# Create an index table with all genomic ranges, then check for 
# overlaps in each of the three tables created from the input
# columns:
index_table <- unique(rbind(ct1, ct2, ct3))

# Returns entries from the index_table if they overlap > 50% any 
# entries in the lookup table or vice-versa
get_overlapping_ranges <- function(index_table, lookup_table) {
  # This function does two non-equi joins. First, it checks whether 
  # any entries in the index_table have a 50% overlap with any 
  # entries in  the lookup table. Second, it does the reverse, and  
  # checks whether any entries in the lookup_table have a 50% overlap 
  # with any entries in the index_table. This is required due to 
  # differing window sizes:
  # e.g. 0-20 significantly overlaps 10-100, but 10-100 does not 
  # significantly overlap 0-20.

  # We will need to create a "middle" column for each genomic range.
  # We will need to create copies of each table to do this, otherwise
  # they will end up with this new column as a side effect of the 
  # function call.
  index_copy <- copy(index_table)
  lookup_copy <- copy(lookup_table)

  index_copy[, middle := start + (end - start) * 0.5]
  lookup_copy[, middle := start + (end - start) * 0.5]

  # In the index_copy we will also need to create dummy columns for
  # the check. We need to do this so we can find the appropriate 
  # genomic window from the index table when we do the second  
  # non-equi join, otherwise the start and end columns will be 
  # clobbered. 
  index_copy[, start_index := start]
  index_copy[, end_index := end]

  # If the middle of a genomic range in the index table falls within 
  # a genomic range in the lookup table, then that genomic entry from 
  # the index table has a significant overlap with the corresponding 
  # in the lookup table
  index_overlaps <- index_copy[lookup_table, 
    on=.(chr, middle >= start, middle <= end),
    nomatch=0]

  # Test the reverse: does any entry in the lookup table 
  # significantly  overlap with any of the genomic windows in the 
  # index table?
  lookup_overlaps <- index_copy[lookup_copy,
    on=.(chr, start_index <= middle, end_index >= middle),
    nomatch=0]

  # Remove extra columns created by the non-equi join:
  index_overlaps <- index_overlaps[,.(chr, start, end)]
  lookup_overlaps <- lookup_overlaps[,.(chr, start, end)]

  # Combine results and remove any duplicates that arise, e.g. 
  # because a genomic window in the index_table significantly 
  # overlaps with multiple genomic windows in the lookup table, or 
  # because the overlap is significant in both comparisons (i.e. 
  # where both windows are the same size).
  overlaps <- unique(rbind(index_overlaps, lookup_overlaps))

  return(overlaps)
}

ranges_in_ct1 <- get_overlapping_ranges(index_table, ct1)
ranges_in_ct2 <- get_overlapping_ranges(index_table, ct2)
ranges_in_ct3 <- get_overlapping_ranges(index_table, ct3)

# Combine results, indicating which column each genomic range was 
# found to overlap:
overlaps <- rbind(
  CT1=ranges_in_ct1, CT2=ranges_in_ct2, CT3=ranges_in_ct3,
  idcol="input_column"
) 

# Recombine the chr, start, and end columns to the original format:
overlaps[, genomic_window := paste0(chr, ":", start, "-", end)]
overlaps[, c("chr", "start", "end") := NULL]

# Convert to the wide format, so that each input column either has a  
# 1 or 0 if the genomic window overlaps with 50% any other found in 
# that column
overlaps <- dcast(overlaps, genomic_window ~ input_column, 
                  fun.aggregate = length)

# Convert back to a data.frame:
overlaps <- as.data.frame(overlaps)
rownames(overlaps) <- overlaps$genomic_window
overlaps <- overlaps[,-1]

# Reorder so we can double check against the desired output:
overlaps <- overlaps[rownames(output),]
print(overlaps)

这将生成(几乎)您提供的相同输出:

This will generate (almost) the same output you've provided:

              CT1 CT2 CT3
chr1:200-400    1   1   0
chr1:800-970    1   0   0
chr2:300-700    1   1   0
chr1:250-450    1   1   0
chr2:200-500    1   1   0
chr2:600-1000   0   1   1
chr1:400-800    0   0   1
chr1:700-870    0   0   1
chr2:700-1400   0   1   1

唯一的区别是 chr1:700-870 在 CT2 列中有一个 0:这是因为它实际上不与 CT2 中的任何基因组窗口重叠,染色体 1 上唯一的另一个窗口是 chr1:250-450.

The only difference is that chr1:700-870 has a 0 in the CT2 column: this is because it actually doesn't overlap any of the genomic windows in CT2, the only other window on chromosome 1 was chr1:250-450.

这篇关于基于范围重叠处理输入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆