根据重要标准有效地合并两个数据帧 [英] Efficiently merging two data frames on a non-trivial criteria

查看:15
本文介绍了根据重要标准有效地合并两个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

昨晚回答这个问题,我花了一个小时试图找到一个没有在 for 循环中增加 data.frame 的解决方案,没有任何成功,所以我很好奇是否有更好的方法来解决这个问题.

Answering this question last night, I spent a good hour trying to find a solution that didn't grow a data.frame in a for loop, without any success, so I'm curious if there's a better way to go about this problem.

问题的一般情况归结为:

The general case of the problem boils down to this:

  • 合并两个 data.frames
  • data.frame 中的条目可以在另一个中有 0 个或多个匹配条目.
  • 我们只关心在两者之间有 1 个或多个匹配项的条目.
  • 匹配函数很复杂,涉及到 data.frames
  • 中的多个列
  • Merge two data.frames
  • Entries in either data.frame can have 0 or more matching entries in the other.
  • We only care about entries that have 1 or more matches across both.
  • The match function is complex involving multiple columns in both data.frames

对于一个具体的例子,我将使用与链接问题类似的数据:

For a concrete example I will use similar data to the linked question:

genes <- data.frame(gene       = letters[1:5], 
                    chromosome = c(2,1,2,1,3),
                    start      = c(100, 100, 500, 350, 321),
                    end        = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
                   chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
                   position   = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))

还有我们复杂的匹配函数:

And our complex matching function:

# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
  return(
    marker$chromosome == gene$chromosome & 
    marker$postion >= (gene$start - 10) &
    marker$postion <= (gene$end + 10)
  )
}

对于 isMatchTRUE<的条目,输出应该看起来像两个 data.frames 的 sql INNER JOIN/代码>.我尝试构建两个 data.frames 以便在另一个 data.frame 中可以有 0 个或多个匹配项.

The output should look like an sql INNER JOIN of the two data.frames, for entries where isMatch is TRUE. I've tried to construct the two data.frames so that there can be 0 or more matches in the other data.frame.

我想出的解决方案如下:

The solution I came up with is as follows:

joined <- data.frame()
for (i in 1:nrow(genes)) {
   # This repeated subsetting returns the same results as `isMatch` applied across
   # the `markers` data.frame for each entry in `genes`.
   matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
   matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
   matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
   # matches may now be 0 or more rows, which we want to repeat the gene for:
   if(nrow(matches) != 0) {
     joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
   }
}

给出结果:

   gene chromosome start end marker position
1     a          2   100 200      3       96
2     a          2   100 200      4      206
3     b          1   100 200      1      105
4     b          1   100 200      5      150
5     b          1   100 200      9      120
51    e          3   321 567      6      400

这是一个相当丑陋和笨拙的解决方案,但我尝试的任何其他方法都失败了:

This is quite an ugly and clungy solution, but anything else I tried was met with failure:

  • 使用apply,给了我一个list,其中每个元素都是一个矩阵,无法rbind它们.
  • 我不能先指定joined的尺寸,因为我没有知道我最终需要多少行.
  • use of apply, gave me a list where each element was a matrix, with no way to rbind them.
  • I can't specify the dimensions of joined first, because I don't know how many rows I will need in the end.

我相信我以后会想出这个一般形式的问题.那么解决这类问题的正确方法是什么?

I'm sure I will come up with a problem of this general form in the future. So what's the correct way to solve this kind of problem?

推荐答案

数据表解决方案:滚动连接满足第一个不等式,然后进行向量扫描以满足第二个不等式.join-on-first-inequality 将有比最终结果更多的行(因此可能会遇到内存问题),但它会小于 这个答案.

A data table solution: a rolling join to fulfill the first inequality, followed by a vector scan to satisfy the second inequality. The join-on-first-inequality will have more rows than the final result (and therefore may run into memory issues), but it will be smaller than a straight-up merge in this answer.

require(data.table)

genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)

markers <- as.data.table(markers)
setkey(markers, chromosome, position)

new <- genes_start[
    ##join genes to markers
    markers, 
    ##rolling the last key column of genes_start (start_bound) forward
    ##to match the last key column of markers (position)
    roll = Inf, 
    ##inner join
    nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
   # chromosome gene start end marker position
# 1:          1    b   100 200      1      105
# 2:          1    b   100 200      9      120
# 3:          1    b   100 200      5      150
# 4:          2    a   100 200      3       96
# 5:          2    a   100 200      4      206
# 6:          3    e   321 567      6      400

可以进行双重连接,但由于它涉及在第二次连接之前重新键入数据表,我认为它不会比上面的矢量扫描解决方案更快.

One could do a double join, but as it involves re-keying the data table before the second join, I don't think that it will be faster than the vector scan solution above.

##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)

## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
    setkey(
        genes_start[
        markers, 
        roll = Inf, 
        nomatch = 0
    ], chromosome, gene, start_bound), 
    roll = -Inf, 
    nomatch = 0
]
setnames(new2, "end_bound", "position")

这篇关于根据重要标准有效地合并两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆