R，在向量化范围内加入 [英] R, join within a range vectorised

查看：89 发布时间：2020/5/9 0:35:50 r merge dplyr sqldf

本文介绍了R，在向量化范围内加入的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试加入两个数据集，其中一个数据集中的变量(或沿基因组的位置)适合第二个数据集的范围(基因起始/终止位置).但是，位置不是唯一的，而是嵌套在其他列(染色体)中.基因起始/终止位置也是如此.我的目标是将每个位置与相应的注释和效果链接起来.

I'm trying to join two datasets where a variable (or position along a genome) in one dataset fits within a range in the second (gene start/stop position). However, positions are not unique, but nested within an additional column (chromosome). The same goes for the gene start/stop positions. My goal is to link each position with the corresponding annotation and effect.

例如:

library(sqldf)
set.seed(100)
a <- data.frame(
    annotation = sample(c("this", "that", "other"), 3, replace=TRUE),
    start = seq(1, 30, 10),
    chr = sample(1:3, 3, replace=TRUE)
  )
a$stop <- a$start + 10
b <- data.frame(
    chr = sample(1:3, 3, replace=TRUE),
    position = sample(1:15, 3, replace=TRUE),
    effect = sample(c("high", "low"), 3, replace=TRUE)
  )

SQL内部联接使我成为其中的一部分:

An SQL inner join gets me part of the way there:

df<-sqldf("SELECT a.start, a.stop, a.annotation, b.effect, b.position
    FROM a, b
    inner JOIN a b on(b.position >= a.start and b.position <= a.stop);")

但是，这不能解释每个染色体上位置的重复. 我在将其包装到循环或应用函数中时遇到了概念上的麻烦.

But this doesn't account for the repetition of position per chromosome. I'm having conceptual trouble wrapping this into a loop or apply function.

我不拘泥于SQL，这只是我以前解决一个简单问题的方式.我也不确定是否需要增加索引列，因为我有成千上万的染色体值.

I'm not wedded to SQL, it's just how I tackled a simpler problem previously. I'm also not sure that making an additional index column is appropriate since I have thousands of chromosome values.

我想要的输出如下所示:

My desired output would look like the following:

    df$chr<-c("NA","2","2")
      start stop annotation effect position chr
1     1   11       this   high        3  NA
2     1   11       this   high       10  NA
3    11   21       this    low       14   2

将每个position放置在正确的chr上的start和stop点之间，或者给出NA的位置，其中chr上的点均不匹配.

Where each position has been placed between the start and stop points on the correct chr, or given NA where no points on a chr match.

R，在向量化范围内加入 [英] R, join within a range vectorised

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R，在向量化范围内加入 [英] R, join within a range vectorised

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭