子集只有那些间隔不落在另一个数据框架内的那些行 [英] Subset only those rows whose intervals does not fall within another data.frame

查看:100
本文介绍了子集只有那些间隔不落在另一个数据框架内的那些行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何比较不等长度的两个数据帧(测试和控制),并根据三个标准从测试中删除行,i)如果测试$ chr == control $ chr
ii)test $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ p> test =
R_level logp chr start end CNA基因
2 7.079 11 1159 1360收益Recl,Bcl
11 2.4 12 6335 6345 loss Pekg
3 19 13 7180 7229损失Sox1

控制=

R_level logp chr开始结束CNA基因
2 5.9 11 1100 1400收益Recl,Bcl
2 3.46 11 1002 1345收益Trp1
2 6.4 12 6705 6845收益Pekg
4 7 13 6480 8129损失Sox1

结果应该看起来像这样

  result = 
R_level logp chr start end CNA Gene
11 2.4 12 6335 6345 loss Pekg


解决方案

使用 foverlaps() data.table

  require(data.table)#v1.9.4 + 
dt1 < - as.data.table(test)
dt2 < - as.data.table(control)
setkey(dt2,chr,CNA,start,end)

olaps = foverlaps(dt1,dt2,nomatch = 0L,which = TRUE,type =within)
#xid yid
#1:1 2
#2:3 4

dt1 [!olaps $ xid]
#R_level logp chr start end CNA Gene
#1:11 2.4 12 6335 6345 loss Pekg

阅读?foverlaps ,有关详细信息,请参阅示例部分。



或者,您还可以使用 GenomicRanges 包。但是,您可能必须根据重叠区域(AFAICT)合并后根据 CNA 进行过滤。


How can i compare two data frames (test and control) of unequal length, and remove the row from test based on three criteria, i) if the test$chr == control$chr ii) test$start and test$end lies with in the range of control$start and control$end iii) test$CNA and control$CNA are same.

    test = 
        R_level  logp   chr start   end     CNA    Gene
        2     7.079     11  1159    1360    gain   Recl,Bcl
        11    2.4       12  6335    6345    loss   Pekg
        3     19        13  7180    7229    loss   Sox1

control =

  R_level    logp   chr  start  end     CNA    Gene
        2     5.9     11  1100  1400    gain   Recl,Bcl 
        2     3.46    11  1002  1345    gain    Trp1
        2     6.4     12  6705  6845    gain    Pekg
        4     7       13  6480  8129    loss    Sox1

The result should look something like this

result =
     R_level     logp   chr start   end     CNA     Gene
          11      2.4    12  6335   6345    loss   Pekg

解决方案

Here's one way using foverlaps() from data.table.

require(data.table) # v1.9.4+
dt1 <- as.data.table(test)
dt2 <- as.data.table(control)
setkey(dt2, chr, CNA, start, end)

olaps = foverlaps(dt1, dt2, nomatch=0L, which=TRUE, type="within")
#    xid yid
# 1:   1   2
# 2:   3   4

dt1[!olaps$xid]
#    R_level logp chr start  end  CNA Gene
# 1:      11  2.4  12  6335 6345 loss Pekg

Read ?foverlaps and see the examples section for more info.

Alternatively, you can also use GenomicRanges package. However, you might have to filter based on CNA after merging by overlapping regions (AFAICT).

这篇关于子集只有那些间隔不落在另一个数据框架内的那些行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆