在R中按范围合并-应用循环 [英] Merge by Range in R - Applying Loops

查看:69
本文介绍了在R中按范围合并-应用循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里发布了一个问题: R中的匹配范围合并有关合并两个文件根据一个文件中的数字落入第二个文件中的范围.到目前为止,我未能成功地将代码拼凑在一起来完成此任务.我遇到的问题是我使用的代码逐行比较文件.这是一个问题,因为1.)一个文件比另一个文件长得多,并且2.)我需要较短文件中的行通过较长文件中的每个范围对进行扫描-而不仅仅是同一行中的范围.

I posted a question here: Matched Range Merge in R about merging two files based on a number in one file falling into a range in the second file. Thus far, I have been unsuccessful in piecing together code to accomplish this. The issue I am having is that the code I'm using compares the files line by line. This is a problem because 1.) One file is much longer than the other file, and 2.) I need the lines in the shorter file to be scanned through every range pair in the longer file - not just the range in the same row.

我一直在使用原始问题中发布的功能,我觉得应该有一种方法可以将其应用到一个更通用的循环中,该循环将第一个文件中的每一行与第二个文件中的每一行进行比较,但是我还没弄清楚.如果有人有任何建议,我将不胜感激.

I have been working with the functions posted in the original question, and I feel like there should be a way to apply it to a more general loop that compares every line in the first file to each line in the second file, but I haven't figured it out yet. If anyone has any suggestions, I would appreciate it.

****已编辑.

数据的本质是这样的:每个范围不一定都是唯一的,尽管大多数都是唯一的.它们的大小也不相等,有些完全落入其他范围之内. findInterval因此会产生错误,因为无法对范围进行排序以使其降为非降序".

The nature of the data is this: Each range is not necessarily unique, although most are. They are also not of equal size, and some fall completely within others. findInterval therefore produces an error, because the ranges cannot be sorted in order to fall in "non-decending" order.

这是每个数据帧的前6行:

Here are first 6 lines of each data frame:

file1test <- data.frame(SNP=c("rs2343", "rs211", "rs754", "rs854", "rs343", "rs626"), BP=c(860269, 369640, 861822, 367934, 706940, 717244))


file2 <- data.frame(Gene=c("E613", "E92", "E49", "E3543", "E11", "E233"), BP_start=c(367640, 621059, 721320, 860260, 861322, 879584), BP_end = c(368634, 622053, 722513, 879955, 879533, 894689))

因此,如您所见,第5行的范围位于第4行的范围内,并且第一个文件中的两个SNP位于第4行的范围内,但是只有一个落在第4行的范围内第二行.

So, as you can see, the range on the 5th line lies within the range on the 4th line, and two SNPs from the first file fall within the range on the 4th line, but only one falls within the range on the second line.

包含SNP的第一个文件只有约400行.但是,包含范围的第二个文件大约有20K.我想产生的输出是一个数据帧,其中包含第一个文件(SNP)中的行,并且BP落入第二个文件中的BP范围.如果SNP落入两个范围,则它将出现两次,等等.

The first file, which contains the SNPs, has only ~400 lines. However the second file, containing the ranges, has about 20K. What I would like to produce as an output is a data frame containing the lines from the first file (the SNPs) with BPs that fall into the BP range in the second file. If a SNP falls into two ranges, then it would appear twice, etc.

推荐答案

GenomicRanges 包在Bioconductor中就是为此类型的操作而设计的.使用例如read.delim读取数据,以便

The GenomicRanges package in Bioconductor is designed for this type of operation. Read your data in with, e.g., read.delim so that

con <- textConnection("SNP     BP
rs064   12292
rs319   345367
rs285   700042")
snps <- read.delim(con, head=TRUE, sep="")

con <- textConnection("Gene    BP_start    BP_end
E613    345344      363401
E92     694501      705408
E49     362370      368340") ## missing trailing digit on BP_end??
genes <- read.delim(con, head=TRUE, sep="")

然后从每个对象中创建"IRanges"

then create 'IRanges' out of each

library(IRanges)
isnps <- with(snps, IRanges(BP, width=1, names=SNP))
igenes <- with(genes, IRanges(BP_start, BP_end, names=Gene)

(请注意坐标系,IRanges希望将起点和终点包括在范围内;此外,当end =起点-1时,终点> =起点应为0宽度范围).然后找到与基因(受试者")重叠的SNP(IRanges术语中的查询")

(pay attention to coordinate systems, IRanges expects start and end to be included in the range; also, end >= start expect for 0-width ranges when end = start - 1). Then find the SNPs ('query' in IRanges terminology) that overlap the genes ('subject')

olaps <- findOverlaps(isnps, igenes)

两个SNP重叠

> queryHits(olaps)
[1] 2 3

它们重叠了第一和第二个基因

and they overlap the first and second genes

> subjectHits(olaps)
[1] 1 2

如果查询与多个基因重叠,则会在queryHits中重复该查询(反之亦然).然后,您可以将数据框合并为

If a query overlapped multiple genes, it would have been repeated in queryHits (and vice versa). You could then merge your data frames as

> cbind(snps[queryHits(olaps),], genes[subjectHits(olaps),])
    SNP     BP Gene BP_start BP_end
2 rs319 345367 E613   345344 363401
3 rs285 700042  E92   694501 705408

通常,基因和SNP具有染色体和链("+",-"或"*"表示链并不重要)信息,您可能希望在这些信息中进行重叠;与其创建"IRanges"实例,不如创建"GRanges"(基因组范围),随后的记账将为您服务

Usually genes and SNPs have chromosome and strand ('+', '-', or '*' to indicate that strand isn't important) information, and you'd want to do overlaps in the context of these; instead of creating 'IRanges' instances, you'd create 'GRanges' (genomic ranges) and the subsequent book-keeping would be taken care of for you

library(GenomicRanges)
isnps <-
    with(snps, GRanges("chrA", IRanges(BP, width=1, names=SNP), "*")
igenes <-
    with(genes, GRanges("chrA", IRanges(BP_start, BP_end, names=Gene), "+"))

这篇关于在R中按范围合并-应用循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆