R,在向量化范围内加入 [英] R, join within a range vectorised
问题描述
我正在尝试加入两个数据集,其中一个数据集中的变量(或沿基因组的位置)适合第二个数据集的范围(基因起始/终止位置).但是,位置不是唯一的,而是嵌套在其他列(染色体)中.基因起始/终止位置也是如此.我的目标是将每个位置与相应的注释和效果链接起来.
I'm trying to join two datasets where a variable (or position along a genome) in one dataset fits within a range in the second (gene start/stop position). However, positions are not unique, but nested within an additional column (chromosome). The same goes for the gene start/stop positions. My goal is to link each position with the corresponding annotation and effect.
例如:
library(sqldf)
set.seed(100)
a <- data.frame(
annotation = sample(c("this", "that", "other"), 3, replace=TRUE),
start = seq(1, 30, 10),
chr = sample(1:3, 3, replace=TRUE)
)
a$stop <- a$start + 10
b <- data.frame(
chr = sample(1:3, 3, replace=TRUE),
position = sample(1:15, 3, replace=TRUE),
effect = sample(c("high", "low"), 3, replace=TRUE)
)
SQL内部联接使我成为其中的一部分:
An SQL inner join gets me part of the way there:
df<-sqldf("SELECT a.start, a.stop, a.annotation, b.effect, b.position
FROM a, b
inner JOIN a b on(b.position >= a.start and b.position <= a.stop);")
但是,这不能解释每个染色体上位置的重复. 我在将其包装到循环或应用函数中时遇到了概念上的麻烦.
But this doesn't account for the repetition of position per chromosome. I'm having conceptual trouble wrapping this into a loop or apply function.
我不拘泥于SQL,这只是我以前解决一个简单问题的方式.我也不确定是否需要增加索引列,因为我有成千上万的染色体值.
I'm not wedded to SQL, it's just how I tackled a simpler problem previously. I'm also not sure that making an additional index column is appropriate since I have thousands of chromosome values.
我想要的输出如下所示:
My desired output would look like the following:
df$chr<-c("NA","2","2")
start stop annotation effect position chr
1 1 11 this high 3 NA
2 1 11 this high 10 NA
3 11 21 this low 14 2
将每个position
放置在正确的chr
上的start
和stop
点之间,或者给出NA
的位置,其中chr
上的点均不匹配.
Where each position
has been placed between the start
and stop
points on the correct chr
, or given NA
where no points on a chr
match.
推荐答案
我认为这是您追求的目标
I think this is what you're after:
sqldf(
"Select start, stop, annotation, effect, position,
case when a.chr = b.chr then a.chr else NULL end as chr
from b left join a
on b.position between a.start and a.stop
"
)
# start stop annotation effect position chr
# 1 1 11 this high 3 NA
# 2 1 11 this high 10 NA
# 3 11 21 this low 14 2
这篇关于R,在向量化范围内加入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!