比较不同数据集中的多个列以查找范围 R 内的值 [英] Comparing multiple columns in different data sets to find values within range R

查看:13
本文介绍了比较不同数据集中的多个列以查找范围 R 内的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集.一个称为域(d),它作为关于基因和表的一般信息称为突变(m).这两个表都有类似的列,称为 Gene.name,我将使用它来查找.这两个数据集的列数或行数不同.

我想遍历文件突变中的所有数据并检查列gene.name中找到的数据是否也存在于文件域中.如果是这样,我希望它检查列突变中的数据是否在列开始"和结束"之间(它们可以等于开始或结束).如果是,我想将其打印到包含合并列的新表中:Gene.Name、Mutation 和域信息.如果它不存在,请忽略它.

这就是我目前所拥有的:

<块引用>

d<-read.table("domains.txt")

dGene.name 域 开始 结束ABCF1 低复杂度区域 2 13DKK1 低复杂度区域 25 39ABCF1 AAA 328 532F2 盘绕线圈区域 499 558

<块引用>

m<-read.table("mutations.tx")

<代码>m基因名称突变ABCF1 10丹麦克朗1 21ABCF1 335xyz 15F2 499

<块引用>

newfile<-m[, list(new=findInterval(d(c(d$Start,d$End)),by'=Gene.Name']

我的代码不起作用,我正在阅读很多不同的问题/答案,我更加困惑.任何帮助都会很棒.

我希望我的最终数据如下所示:

Gene.name 突变域DKK1 21 low_complexity_regionABCF1 335 AAAF2 499 盘绕线圈区域

解决方案

合并和子集应该可以让您到达那里(尽管我认为您的预期结果与您对所需内容的描述不匹配):

result <- merge(d,m,by="Gene.name")结果[with(result,Mutation >= Start & Mutation <= End),]# Gene.name Domain Start End Mutation#1 ABCF1 低复杂度区域 2 13 10#4 ABCF1 AAA 328 532 335#6 F2 盘绕线圈区域 499 558 499

I have two datasets. One called domain (d) which as general information about a gene and table called mutation (m). Both tables have similar column called Gene.name, which I'll use to look for. The two datasets do not have the same number of columns or rows.

I want to go through all the data in the file mutation and check to see whether the data found in column gene.name also exists in the file domain. If it does, I want it to check whether the data in column mutation is between the column "Start" and "End" (they can be equal to Start or End). If it is, I want to print it out to a new table with the merged column: Gene.Name, Mutation, and the domain information. If it doesn't exist, ignore it.

So this is what I have so far:

d<-read.table("domains.txt")

d
Gene.name Domain Start  End
ABCF1   low_complexity_region   2   13
DKK1    low_complexity_region   25  39
ABCF1   AAA 328 532
F2  coiled_coil_region  499 558

m<-read.table("mutations.tx")

m
Gene.name   Mutation        
ABCF1   10      
DKK1    21      
ABCF1   335     
xyz 15      
F2  499     

newfile<-m[, list(new=findInterval(d(c(d$Start, d$End)),by'=Gene.Name']

My code isn't working and I'm reading a lot of different questions/answers and I'm much more confused. Any help would be great.

I"d like my final data to look like this:

Gene.name   Mutation    Domain  
DKK1    21  low_complexity_region   
ABCF1   335 AAA 
F2  499 coiled_coil_region  

解决方案

A merge and subset should get you there (though I think your intended result doesn't match your description of what you want):

result <- merge(d,m,by="Gene.name")
result[with(result,Mutation >= Start & Mutation <= End),]

#  Gene.name                Domain Start End Mutation
#1     ABCF1 low_complexity_region     2  13       10
#4     ABCF1                   AAA   328 532      335
#6        F2    coiled_coil_region   499 558      499

这篇关于比较不同数据集中的多个列以查找范围 R 内的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆