非设备联接-比较R中的两个数据帧 [英] Non-equi joins - comparing two data frames in R

查看:56
本文介绍了非设备联接-比较R中的两个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据第二个数据帧中存在的值来过滤数据帧.

I would like to filter a data frame based on the values present in a second data frame.

例如,匹配第一个数据帧中在"BP"列中大于"start_pos"列的第一个值并且小于"end_pos"列或仅小于"end_pos"列的行第二个数据帧.

For example, match the rows from the first data frame that, in the column "BP", are higher than the first value of the "start_pos" column and smaller than "end_pos" column or just smaller than "end_pos" in the second data frame.

我需要对第二个数据帧中的所有值重复此过程.目前,我正在使用for循环执行这些操作.但是,我想用一个命令来完成.

I need to repeat this procedure for all the values in the second data frame. Currently, I am performing these using a for loop. However, I would like to do it in a single command.

数据帧1

CHR       BP
29   836019
29  4417047
29  7589996
29 11052921
29 14009294
29 33174196

数据框2

start_pos end_pos            gene_id
19774   19899 ENSBTAG00000046619
34627   35558 ENSBTAG00000006858
69695   71121 ENSBTAG00000039257
83323   84281 ENSBTAG00000035349
124849  179713 ENSBTAG00000001753
264298  264843 ENSBTAG00000005540

for(j in 1:nrow(tmp_markers)){

      temp_out_markers<- tmp_markers[j,]
      tmp_search<-tmp_gene[which((tmp_markers[j,"BP"]>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]<= tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval >=tmp_gene[,"end_pos"]& tmp_markers[j,"BP"]<=tmp_gene[,"start_pos"])| (tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"] & tmp_markers[j,"BP"]-interval >=tmp_gene[,"start_pos"])|(tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"]  &  tmp_markers[j,"BP"]-interval<=tmp_gene[,"start_pos"] &  tmp_markers[j,"BP"]>=tmp_gene[,"end_pos"])),]


      if(nrow(tmp_search)>0){                     
        temp_out<-cbind(temp_out_markers[rep(seq_len(nrow(tmp_search))),],tmp_search)
        temp_out[,"Distance_from_gene_start"]<-temp_out[,"BP"]-temp_out[,"start_pos"]
        temp_out[,"Distance_from_gene_end"]<-temp_out[,"BP"]-temp_out[,"end_pos"]
        output_genes<-rbind(temp_out,output_genes)
      }
    }

最后,我想要一个数据帧,其中包含我测试的间隔内的所有行.

At the end, I want a data frame with all the rows that are within my tested intervals.

推荐答案

非常感谢!

我以这个解决方案结束了,并且效果很好.

I ended with this solution and it is working very well.

foverlaps(tmp_gene, tmp_markers, by.x = c("start_pos","end_pos"), by.y = 
key(tmp_markers),nomatch = 0)

干杯.

这篇关于非设备联接-比较R中的两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆