有效地将两个数据帧合并在一个不重要的标准上 [英] Efficiently merging two data frames on a non-trivial criteria
问题描述
昨天晚上回答这个问题,我花了一个小时尝试找到一个解决方案没有在for循环中生长 data.frame
,没有任何成功,所以我很好奇,如果有一个更好的方法来解决这个问题。 p>
问题的一般情况归结为:
- 合并两个
data.frames
-
data.frame
中的条目可以有0个或多个匹配 - 我们只关心在两者之间有一个或多个匹配的条目。
- 匹配函数很复杂,
data.frame
s
到连结的问题:
genes< - data.frame(gene = letters [1:5],
染色体= c(2,1,2,1,3),
start = c(100,100,500,350,321),
end = c(200,200,600,400, 567))
markers< - data.frame(marker = 1:10,
chromosome = c(1,1,2,2,1,3,4,3,1,2)
position = c(105,300,96,206,150,400,25,300,120,700))
和我们的复杂匹配函数:
#匹配条件,适用于每个数据的单个条目.frame
isMatch < - function(marker,gene){
return(
marker $ chromosome == gene $ chromosome&
marker $ postion> =(gene $ start - 10)&
marker $ postion< =(gene $ end + 10)
)
}
b $ b
输出应类似于两个data.frames的 sql
INNER JOIN
isMatch
是 TRUE
。
我试图构造两个 data.frames
,以便在其他中可以有0个或更多的匹配data.frame
。
我想出的解决方案如下:
加入<-data.frame()
for(i in 1:nrow(genes)){
#这个重复子集化返回的结果与`isMatch' b $ b#`````````````````。
匹配< - markers [其中(标记$染色体==基因[i,染色体]),]
匹配< - matches [其(匹配$ pos& ,start] - 10)),]
匹配< - matches [which(matches $ pos <=(genes [i,end] + 10)),
#现在可以是0行或更多行,我们要重复基因为:
if(nrow(matches)!= 0){
joined< - rbind(joined,cbind(genes [ ],matches [,c(marker,position)]))
}
}
给出结果:
基因染色体开始结束标记位置
1 a 2 100 200 3 96
2 a 2 100 200 4 206
3 b 1 100 200 1 105
4 b 1 100 200 5 150
5 b 1 100 200 9 120
51 e 3 321 567 6 400
这是一个非常丑陋的解决方案,遇到失败:
- 使用
apply
,给了我一个list
其中每个元素是一个矩阵,
无法rbind
。 - 我不能首先指定
加入的维度,因为我不是
知道到底需要多少行。
我相信我将来会出现这种一般形式的问题。因此,解决这种问题的正确方法是什么?
数据表解决方案:滚动连接以满足第一个不等式,接着进行矢量扫描以满足第二不等式。 join-on-first-inequality将比最终结果有更多的行(因此可能会遇到内存问题),但它会小于此回答。
require(data.table)
genes_start< - as.data.table(genes)
##创建一个单独的列来开始绑定到
genes_start [,`:=`(start_bound = start - 10)]
setkey(genes_start,chromosome,start_bound)
markers< - as.data.table(markers)
setkey(markers,chromosome,position)
new< - genes_start [
##将基因连接到标记
标记,
滚动genes_start(start_bound)的最后一个键列forward
##以匹配最后一列标记(位置)
roll = Inf,
##内部连接
nomatch = 0
##滚动连接从标记中留下位置列
##其中列名称来自genes_start(start_bound)
##现在向量扫描以满足其他标准
] [start_bound <= end + 10]
##更改名称和列顺序以匹配期望的结果
setnames(new,start_bound,position)
setcolorder(new,c(chromosome,gene,start,end,marker position))
#染色体基因开始结束标记位置
#1:1 b 100 200 1 105
#2:1 b 100 200 9 120
#3:1 b 100 200 5 150
#4:2 a 100 200 3 96
#5:2 a 100 200 4 206
#6:3 e 321 567 6 400
可以做一个双连接,但是它涉及在第二次连接之前重新键入数据表,我不认为它将比上面的向量扫描解决方案更快。
##创建一个基因对象的副本,
genes_end< - as.data.table(genes)
genes_end [,`:=`(end_bound = end + 10,start = NULL,end = NULL)]
setkey ,染色体,基因,end_bound)
##以前,包裹在一个类似的连接(但这次向后滚动)
new_2< - genes_end [
setkey $ b genes_start [
markers,
roll = Inf,
nomatch = 0
],chromosome,gene,start_bound),
roll = -Inf,
nomatch = 0
]
setnames(new2,end_bound,position)
Answering this question last night, I spent a good hour trying to find a solution that didn't grow a data.frame
in a for loop, without any success, so I'm curious if there's a better way to go about this problem.
The general case of the problem boils down to this:
- Merge two
data.frames
- Entries in either
data.frame
can have 0 or more matching entries in the other. - We only care about entries that have 1 or more matches across both.
- The match function is complex involving multiple columns in both
data.frame
s
For a concrete example I will use similar data to the linked question:
genes <- data.frame(gene = letters[1:5],
chromosome = c(2,1,2,1,3),
start = c(100, 100, 500, 350, 321),
end = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
position = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))
And our complex matching function:
# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
return(
marker$chromosome == gene$chromosome &
marker$postion >= (gene$start - 10) &
marker$postion <= (gene$end + 10)
)
}
The output should look like an sql
INNER JOIN
of the two data.frames, for entries where isMatch
is TRUE
.
I've tried to construct the two data.frames
so that there can be 0 or more matches in the other data.frame
.
The solution I came up with is as follows:
joined <- data.frame()
for (i in 1:nrow(genes)) {
# This repeated subsetting returns the same results as `isMatch` applied across
# the `markers` data.frame for each entry in `genes`.
matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
# matches may now be 0 or more rows, which we want to repeat the gene for:
if(nrow(matches) != 0) {
joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
}
}
Giving the results:
gene chromosome start end marker position
1 a 2 100 200 3 96
2 a 2 100 200 4 206
3 b 1 100 200 1 105
4 b 1 100 200 5 150
5 b 1 100 200 9 120
51 e 3 321 567 6 400
This is quite an ugly and clungy solution, but anything else I tried was met with failure:
- use of
apply
, gave me alist
where each element was a matrix, with no way torbind
them. - I can't specify the dimensions of
joined
first, because I don't know how many rows I will need in the end.
I'm sure I will come up with a problem of this general form in the future. So what's the correct way to solve this kind of problem?
A data table solution: a rolling join to fulfill the first inequality, followed by a vector scan to satisfy the second inequality. The join-on-first-inequality will have more rows than the final result (and therefore may run into memory issues), but it will be smaller than a straight-up merge in this answer.
require(data.table)
genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)
markers <- as.data.table(markers)
setkey(markers, chromosome, position)
new <- genes_start[
##join genes to markers
markers,
##rolling the last key column of genes_start (start_bound) forward
##to match the last key column of markers (position)
roll = Inf,
##inner join
nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
# chromosome gene start end marker position
# 1: 1 b 100 200 1 105
# 2: 1 b 100 200 9 120
# 3: 1 b 100 200 5 150
# 4: 2 a 100 200 3 96
# 5: 2 a 100 200 4 206
# 6: 3 e 321 567 6 400
One could do a double join, but as it involves re-keying the data table before the second join, I don't think that it will be faster than the vector scan solution above.
##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)
## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
setkey(
genes_start[
markers,
roll = Inf,
nomatch = 0
], chromosome, gene, start_bound),
roll = -Inf,
nomatch = 0
]
setnames(new2, "end_bound", "position")
这篇关于有效地将两个数据帧合并在一个不重要的标准上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!