查找数据帧值中的范围 [英] Looking for ranges in dataframe values

查看:156
本文介绍了查找数据帧值中的范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个数据框:

 >访问
V1 V2 V3
1 chr10 136122 136533
2 chr10 179432 179769
3 chr10 182988 183371
4 chr10 224234 224489
5 chr10 237693 237958

 >高峰
V1 V2 V3
1 chr10 126122 126533
2 chr10 179450 179730
3 chr10 182788 183350
4 chr10 224244 224500
5 chr10 237695 237950

第二列和第三列是开始结束的区域(范围)。我想把这些行保存在 peak 数据框中,其中访问$ V1 == peaks $ V1 数据框的范围(或区域)。例如,新的数据框将如下所示: peak dataframe的




  • 1st在访问数据框中不存在行区域,因此将分配类别 U


  • <在访问数据框(第二行)中,的第二行分配的类别 B


  • 第三行不完全落在该地区,但它以某种方式与第三行访问中的区域重叠,因此我将分配类别 A


  • 第四行峰值也不完全重叠,结束11个数字在访问行4中的区域结束,这也将在类别 A 中。


  • 第五排落在该区域,因此将在类别 B




预期输出

 > newdf 
V1 V2 V3 V4
1 chr10 126122 126533 U
2 chr10 179450 179730 B
3 chr10 182788 183350 A
4 chr10 224244 224500 A
5 chr10 237695 237950 B

以下是输入数据框的输入:

 > dput(peaks)
structure(list(V1 = structure(c(1L,1L,1L,1L,1L),.Label =chr10,class =factor),
V2 = c (126122L,179450L,182788L,224244L,237695L),V3 = c(126533L,
179730L,183350L,224500L,237950L)).Names = c(V1,V2,
V3),class =data.frame,row.names = c(NA,-5L))

> dput(access)
structure(list(V1 = structure(c(1L,1L,1L,1L,1L),.Label =chr10,class =factor),
V2 = c (136122L,179432L,182988L,224234L,237693L),V3 = c(136533L,
179769L,183371L,224489L,237958L)).Names = c(V1,V2,
V3),class =data.frame,row.names = c(NA,-5L))

编辑



我的新访问df看起来像这样,现在我也想在最终输出df中附加最后一列:

 >访问
V1 V2 V3 V4
1 chr10 136122 136533找到
2 chr10 179432 179769 notFound
3 chr10 182988 183371 found
4 chr10 224234 224489 found
5 chr10 237693 237958 notFound

所以现在有一个额外的条件是,如果访问行中的峰值范围然后在V4中的值附加到最后一个df的新列,如果没有找到某个区域,那么默认情况下将是 notFound 。因此,最终输出将为:

 > newdf 
V1 V2 V3 V4 V5
1 chr10 126122 126533 U notFound
2 chr10 179450 179730 B notFound
3 chr10 182788 183350 A found
4 chr10 224244 224500 A found
5 chr10 237695 237950 B notFound

这里是 row1 $ V5 该值为notFound,因为没有找到该区域,在剩余的情况下,我们从修改的访问df获取V5中的值。

解决方案

这是另一个(直接的)解决方案,使用最近实现的非Equi连接,并且在当前开发版本的data.table中可用,v1.9.7。请参阅安装说明此处

  require(data.table)#v1.9.7 + 
setDT(access)
setDT(peaks)[,V4:=U]#no重叠
peak [access,V4:=A,on =。(V1,V2 <= V3,V3> = V2)]#任何重叠
peak [access,V4:= (V1,V2> = V2,V3 <= V3)]#完全在
#V1 V2 V3 V4
#1:chr10 126122 126533 U
# 2:chr10 179450 179730 B
#3:chr10 182788 183350 A
#4:chr10 224244 224500 A
#5:chr10 237695 237950 B

将新列添加到 peak ,这是全部为U。然后替换那些与A有任何重叠的行。这将包含在内完全的所有行。然后再次执行条件连接,但这次只能完全在内部,并用B替换。






请注意, foverlaps()解决方案也可以正常工作(它也来自data.table包)。但是,新的非Equi连接很适合 c。[。data.table 语法,允许在加入时聚合/添加/更新cols 。 / p>

I have 2 dataframes:

> access
     V1     V2     V3
1 chr10 136122 136533
2 chr10 179432 179769
3 chr10 182988 183371
4 chr10 224234 224489
5 chr10 237693 237958

and

> peaks
     V1     V2     V3
1 chr10 126122 126533
2 chr10 179450 179730
3 chr10 182788 183350
4 chr10 224244 224500
5 chr10 237695 237950

The coloumn V2 and V3 are start and end of regions (range) in both dataframes. I want to keep those rows in peaks dataframe for which access$V1 == peaks$V1 AND which fall in the range (or regions) of access dataframe. For example the new dataframe will be like: peaks dataframe's

  • 1st row region doesn't exist in access dataframe so it will be assigned category U.

  • 2nd row of peaks falls in the given range in access dataframe (2nd row) and will be assigned category B.

  • 3rd row of peaks doesn't completely fall in that region but it somehow overlaps with region in 3rd row of access, for this I will assign category A.

  • 4th row of peaks also doesn't overlap completely at it ends 11 number after the end of region in row 4 of access, this will also be in category A.

  • 5th row falls in the region hence will be in category B.

Expected output:

> newdf   
     V1     V2     V3 V4
1 chr10 126122 126533  U
2 chr10 179450 179730  B
3 chr10 182788 183350  A
4 chr10 224244 224500  A
5 chr10 237695 237950  B

Here are the dput of input dataframes:

> dput(peaks)
structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "chr10", class = "factor"), 
    V2 = c(126122L, 179450L, 182788L, 224244L, 237695L), V3 = c(126533L, 
    179730L, 183350L, 224500L, 237950L)), .Names = c("V1", "V2", 
"V3"), class = "data.frame", row.names = c(NA, -5L))

> dput(access)
    structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "chr10", class = "factor"), 
        V2 = c(136122L, 179432L, 182988L, 224234L, 237693L), V3 = c(136533L, 
        179769L, 183371L, 224489L, 237958L)), .Names = c("V1", "V2", 
    "V3"), class = "data.frame", row.names = c(NA, -5L))

Edit:

My new access df looks like this and now I also want to append the last column in my final output df:

> access
     V1     V2     V3  V4
1 chr10 136122 136533  found
2 chr10 179432 179769  notFound
3 chr10 182988 183371  found
4 chr10 224234 224489  found
5 chr10 237693 237958  notFound

So now there is one extra condition which is if row in access falls in peaks range then also append the value in V4 in a new column in final df, if some region is not found then by default will be notFound. Therefore, final output will be:

> newdf   
     V1     V2     V3 V4 V5
1 chr10 126122 126533  U notFound
2 chr10 179450 179730  B notFound
3 chr10 182788 183350  A found
4 chr10 224244 224500  A found
5 chr10 237695 237950  B notFound

Here in row1$V5 the value is notFound because this region was not found and in remaining cases we got the values in V5 from modified access df.

解决方案

Here's another (straightforward) solution using the non-equi joins implemented recently and available in the current development version of data.table, v1.9.7. See installation instructions here:

require(data.table) # v1.9.7+
setDT(access)
setDT(peaks)[, V4 := "U"]                              # no overlap
peaks[access, V4 := "A", on=.(V1, V2 <= V3, V3 >= V2)] # any overlap
peaks[access, V4 := "B", on=.(V1, V2 >= V2, V3 <= V3)] # completly within
#       V1     V2     V3 V4
# 1: chr10 126122 126533  U
# 2: chr10 179450 179730  B
# 3: chr10 182788 183350  A
# 4: chr10 224244 224500  A
# 5: chr10 237695 237950  B

Add a new column to peaks which is all "U". Then replace those rows where there's any kind of overlap with "A". That would contain all rows which are also completely "within". Then once again, perform a conditional join, but this time only for completely within, and replace with "B".


Note that the foverlaps() solution would work just fine as well (it also comes from data.table package). But the new non-equi joins fits well with the [.data.table syntax which allows to aggregate/add/update cols while joining.

这篇关于查找数据帧值中的范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆