在 R 中按范围合并 - 应用循环 [英] Merge by Range in R - Applying Loops

查看:33
本文介绍了在 R 中按范围合并 - 应用循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里发布了一个问题:R 中的匹配范围合并关于合并两个文件基于一个文件中的数字落入第二个文件中的范围.到目前为止,我还没有成功地拼凑代码来实现这一点.我遇到的问题是我使用的代码逐行比较文件.这是一个问题,因为 1.) 一个文件比另一个文件长得多,以及 2.) 我需要在较长文件中的每个范围对中扫描较短文件中的行 - 而不仅仅是同一行中的范围.

I posted a question here: Matched Range Merge in R about merging two files based on a number in one file falling into a range in the second file. Thus far, I have been unsuccessful in piecing together code to accomplish this. The issue I am having is that the code I'm using compares the files line by line. This is a problem because 1.) One file is much longer than the other file, and 2.) I need the lines in the shorter file to be scanned through every range pair in the longer file - not just the range in the same row.

我一直在使用原始问题中发布的函数,我觉得应该有一种方法可以将其应用于更通用的循环,将第一个文件中的每一行与第二个文件中的每一行进行比较,但是我还没有弄清楚.如果有人有任何建议,我将不胜感激.

I have been working with the functions posted in the original question, and I feel like there should be a way to apply it to a more general loop that compares every line in the first file to each line in the second file, but I haven't figured it out yet. If anyone has any suggestions, I would appreciate it.

**** 已编辑.

数据的性质是这样的:每个范围不一定是唯一的,尽管大多数是.它们的大小也不相同,有些完全属于其他.findInterval 因此会产生错误,因为无法按非降序"顺序对范围进行排序.

The nature of the data is this: Each range is not necessarily unique, although most are. They are also not of equal size, and some fall completely within others. findInterval therefore produces an error, because the ranges cannot be sorted in order to fall in "non-decending" order.

这是每个数据框的前 6 行:

Here are first 6 lines of each data frame:

file1test <- data.frame(SNP=c("rs2343", "rs211", "rs754", "rs854", "rs343", "rs626"), BP=c(860269, 369640, 861822, 367934, 706940, 717244))


file2 <- data.frame(Gene=c("E613", "E92", "E49", "E3543", "E11", "E233"), BP_start=c(367640, 621059, 721320, 860260, 861322, 879584), BP_end = c(368634, 622053, 722513, 879955, 879533, 894689))

因此,如您所见,第 5 行的范围位于第 4 行的范围内,第一个文件中的两个 SNP 位于第 4 行的范围内,但只有一个位于第 4 行的范围内第二行.

So, as you can see, the range on the 5th line lies within the range on the 4th line, and two SNPs from the first file fall within the range on the 4th line, but only one falls within the range on the second line.

包含 SNP 的第一个文件只有约 400 行.然而,包含范围的第二个文件大约有 20K.我想作为输出生成的是一个数据框,其中包含第一个文件(SNP)中的行,其中 BP 属于第二个文件中的 BP 范围.如果一个 SNP 落入两个范围,那么它就会出现两次,以此类推.

The first file, which contains the SNPs, has only ~400 lines. However the second file, containing the ranges, has about 20K. What I would like to produce as an output is a data frame containing the lines from the first file (the SNPs) with BPs that fall into the BP range in the second file. If a SNP falls into two ranges, then it would appear twice, etc.

推荐答案

Bioconductor 中的 GenomicRanges 包专为此类操作而设计.使用例如 read.delim 读取您的数据,以便

The GenomicRanges package in Bioconductor is designed for this type of operation. Read your data in with, e.g., read.delim so that

con <- textConnection("SNP     BP
rs064   12292
rs319   345367
rs285   700042")
snps <- read.delim(con, head=TRUE, sep="")

con <- textConnection("Gene    BP_start    BP_end
E613    345344      363401
E92     694501      705408
E49     362370      368340") ## missing trailing digit on BP_end??
genes <- read.delim(con, head=TRUE, sep="")

然后从每个创建IRanges"

then create 'IRanges' out of each

library(IRanges)
isnps <- with(snps, IRanges(BP, width=1, names=SNP))
igenes <- with(genes, IRanges(BP_start, BP_end, names=Gene))

(注意坐标系,IRanges 期望 start 和 end 包含在范围内;此外,当 end = start - 1 时,end >= start 期望 0 宽度范围).然后找到与基因('subject')重叠的SNP(IRanges术语中的'query')

(pay attention to coordinate systems, IRanges expects start and end to be included in the range; also, end >= start expect for 0-width ranges when end = start - 1). Then find the SNPs ('query' in IRanges terminology) that overlap the genes ('subject')

olaps <- findOverlaps(isnps, igenes)

两个 SNP 重叠

> queryHits(olaps)
[1] 2 3

它们重叠了第一个和第二个基因

and they overlap the first and second genes

> subjectHits(olaps)
[1] 1 2

如果一个查询与多个基因重叠,它会在 queryHits 中重复(反之亦然).然后您可以将您的数据框合并为

If a query overlapped multiple genes, it would have been repeated in queryHits (and vice versa). You could then merge your data frames as

> cbind(snps[queryHits(olaps),], genes[subjectHits(olaps),])
    SNP     BP Gene BP_start BP_end
2 rs319 345367 E613   345344 363401
3 rs285 700042  E92   694501 705408

通常基因和 SNP 具有染色体和链('+'、'-' 或 '*' 表示该链不重要)信息,并且您希望在这些上下文中进行重叠;您可以创建 'GRanges'(基因组范围),而不是创建 'IRanges' 实例,随后的簿记将由您负责

Usually genes and SNPs have chromosome and strand ('+', '-', or '*' to indicate that strand isn't important) information, and you'd want to do overlaps in the context of these; instead of creating 'IRanges' instances, you'd create 'GRanges' (genomic ranges) and the subsequent book-keeping would be taken care of for you

library(GenomicRanges)
isnps <-
    with(snps, GRanges("chrA", IRanges(BP, width=1, names=SNP), "*")
igenes <-
    with(genes, GRanges("chrA", IRanges(BP_start, BP_end, names=Gene), "+"))

这篇关于在 R 中按范围合并 - 应用循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆