比较不同数据集中的多个列以查找范围R内的值 [英] Comparing multiple columns in different data sets to find values within range R
问题描述
我有两个数据集。一个称为域(d),其作为关于基因和表的一般信息称为突变(m)。两个表都有类似的名为Gene.name的列,我将使用它来查找。这两个数据集的列或行数不相同。
I have two datasets. One called domain (d) which as general information about a gene and table called mutation (m). Both tables have similar column called Gene.name, which I'll use to look for. The two datasets do not have the same number of columns or rows.
我想要处理文件中的所有数据,并检查数据是否存在列gene.name也存在于文件域中。如果是,我想要检查列突变中的数据是否在开始和结束列之间(它们可以等于开始或结束)。如果是,我想打印出来一个新的表与合并列:Gene.Name,Mutation和域信息。如果不存在,请忽略它。
I want to go through all the data in the file mutation and check to see whether the data found in column gene.name also exists in the file domain. If it does, I want it to check whether the data in column mutation is between the column "Start" and "End" (they can be equal to Start or End). If it is, I want to print it out to a new table with the merged column: Gene.Name, Mutation, and the domain information. If it doesn't exist, ignore it.
所以这是我到目前为止:
So this is what I have so far:
d <-read.table(domains.txt)
d<-read.table("domains.txt")
d
Gene.name Domain Start End
ABCF1 low_complexity_region 2 13
DKK1 low_complexity_region 25 39
ABCF1 AAA 328 532
F2 coiled_coil_region 499 558
m <-read.table(mutations.tx)
m<-read.table("mutations.tx")
m
Gene.name Mutation
ABCF1 10
DKK1 21
ABCF1 335
xyz 15
F2 499
newfile <-m [,list(new = findInterval(d(c(d $ Start,
d $ End)),by'= Gene.Name']
newfile<-m[, list(new=findInterval(d(c(d$Start, d$End)),by'=Gene.Name']
我的代码不工作,我正在阅读很多不同的问题/答案,我更困惑,任何帮助将是巨大的。
My code isn't working and I'm reading a lot of different questions/answers and I'm much more confused. Any help would be great.
我喜欢我的最终数据看起来像这样:
I"d like my final data to look like this:
Gene.name Mutation Domain
DKK1 21 low_complexity_region
ABCF1 335 AAA
F2 499 coiled_coil_region
推荐答案
合并和子集应该让你有(虽然我认为你想要的结果不符合你想要的描述):
A merge and subset should get you there (though I think your intended result doesn't match your description of what you want):
result <- merge(d,m,by="Gene.name")
result[with(result,Mutation >= Start & Mutation <= End),]
# Gene.name Domain Start End Mutation
#1 ABCF1 low_complexity_region 2 13 10
#4 ABCF1 AAA 328 532 335
#6 F2 coiled_coil_region 499 558 499
这篇关于比较不同数据集中的多个列以查找范围R内的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!