非设备联接-比较R中的两个数据帧 [英] Non-equi joins - comparing two data frames in R
问题描述
我想根据第二个数据帧中存在的值来过滤数据帧.
I would like to filter a data frame based on the values present in a second data frame.
例如,匹配第一个数据帧中在"BP"列中大于"start_pos"列的第一个值并且小于"end_pos"列或仅小于"end_pos"列的行第二个数据帧.
For example, match the rows from the first data frame that, in the column "BP", are higher than the first value of the "start_pos" column and smaller than "end_pos" column or just smaller than "end_pos" in the second data frame.
我需要对第二个数据帧中的所有值重复此过程.目前,我正在使用for循环执行这些操作.但是,我想用一个命令来完成.
I need to repeat this procedure for all the values in the second data frame. Currently, I am performing these using a for loop. However, I would like to do it in a single command.
数据帧1
CHR BP
29 836019
29 4417047
29 7589996
29 11052921
29 14009294
29 33174196
数据框2
start_pos end_pos gene_id
19774 19899 ENSBTAG00000046619
34627 35558 ENSBTAG00000006858
69695 71121 ENSBTAG00000039257
83323 84281 ENSBTAG00000035349
124849 179713 ENSBTAG00000001753
264298 264843 ENSBTAG00000005540
for(j in 1:nrow(tmp_markers)){
temp_out_markers<- tmp_markers[j,]
tmp_search<-tmp_gene[which((tmp_markers[j,"BP"]>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]<= tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval >=tmp_gene[,"end_pos"]& tmp_markers[j,"BP"]<=tmp_gene[,"start_pos"])| (tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"] & tmp_markers[j,"BP"]-interval >=tmp_gene[,"start_pos"])|(tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"] & tmp_markers[j,"BP"]-interval<=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]>=tmp_gene[,"end_pos"])),]
if(nrow(tmp_search)>0){
temp_out<-cbind(temp_out_markers[rep(seq_len(nrow(tmp_search))),],tmp_search)
temp_out[,"Distance_from_gene_start"]<-temp_out[,"BP"]-temp_out[,"start_pos"]
temp_out[,"Distance_from_gene_end"]<-temp_out[,"BP"]-temp_out[,"end_pos"]
output_genes<-rbind(temp_out,output_genes)
}
}
最后,我想要一个数据帧,其中包含我测试的间隔内的所有行.
At the end, I want a data frame with all the rows that are within my tested intervals.
推荐答案
非常感谢!
我以这个解决方案结束了,并且效果很好.
I ended with this solution and it is working very well.
foverlaps(tmp_gene, tmp_markers, by.x = c("start_pos","end_pos"), by.y =
key(tmp_markers),nomatch = 0)
干杯.
这篇关于非设备联接-比较R中的两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!