查找数据帧值中的范围 [英] Looking for ranges in dataframe values
问题描述
我有2个数据框:
>访问
V1 V2 V3
1 chr10 136122 136533
2 chr10 179432 179769
3 chr10 182988 183371
4 chr10 224234 224489
5 chr10 237693 237958
和
>高峰
V1 V2 V3
1 chr10 126122 126533
2 chr10 179450 179730
3 chr10 182788 183350
4 chr10 224244 224500
5 chr10 237695 237950
第二列和第三列是开始和结束的区域(范围)。我想把这些行保存在 peak
数据框中,其中访问$ V1 == peaks $ V1
数据框的范围(或区域)。例如,新的数据框将如下所示: peak
dataframe的
-
1st在
访问
数据框中不存在行区域,因此将分配类别 U 。 -
第三行
峰
不完全落在该地区,但它以某种方式与第三行访问
中的区域重叠,因此我将分配类别 A 。 -
第四行
峰值
也不完全重叠,结束11个数字在访问行4中的区域结束,这也将在类别 A 中。 -
第五排落在该区域,因此将在类别 B 。
<在
访问
数据框(第二行)中,的第二行分配的类别 B 。 预期输出
> newdf
V1 V2 V3 V4
1 chr10 126122 126533 U
2 chr10 179450 179730 B
3 chr10 182788 183350 A
4 chr10 224244 224500 A
5 chr10 237695 237950 B
以下是输入数据框的输入:
> dput(peaks)
structure(list(V1 = structure(c(1L,1L,1L,1L,1L),.Label =chr10,class =factor),
V2 = c (126122L,179450L,182788L,224244L,237695L),V3 = c(126533L,
179730L,183350L,224500L,237950L)).Names = c(V1,V2,
V3),class =data.frame,row.names = c(NA,-5L))
> dput(access)
structure(list(V1 = structure(c(1L,1L,1L,1L,1L),.Label =chr10,class =factor),
V2 = c (136122L,179432L,182988L,224234L,237693L),V3 = c(136533L,
179769L,183371L,224489L,237958L)).Names = c(V1,V2,
V3),class =data.frame,row.names = c(NA,-5L))
编辑
我的新访问df看起来像这样,现在我也想在最终输出df中附加最后一列:
>访问
V1 V2 V3 V4
1 chr10 136122 136533找到
2 chr10 179432 179769 notFound
3 chr10 182988 183371 found
4 chr10 224234 224489 found
5 chr10 237693 237958 notFound
所以现在有一个额外的条件是,如果访问行中的峰值范围然后在V4中的值附加到最后一个df的新列,如果没有找到某个区域,那么默认情况下将是 notFound
。因此,最终输出将为:
> newdf
V1 V2 V3 V4 V5
1 chr10 126122 126533 U notFound
2 chr10 179450 179730 B notFound
3 chr10 182788 183350 A found
4 chr10 224244 224500 A found
5 chr10 237695 237950 B notFound
这里是 row1 $ V5
该值为notFound,因为没有找到该区域,在剩余的情况下,我们从修改的访问df获取V5中的值。
这是另一个(直接的)解决方案,使用最近实现的非Equi连接,并且在当前开发版本的data.table中可用,v1.9.7。请参阅安装说明此处:
require(data.table)#v1.9.7 +
setDT(access)
setDT(peaks)[,V4:=U]#no重叠
peak [access,V4:=A,on =。(V1,V2 <= V3,V3> = V2)]#任何重叠
peak [access,V4:= (V1,V2> = V2,V3 <= V3)]#完全在
#V1 V2 V3 V4
#1:chr10 126122 126533 U
# 2:chr10 179450 179730 B
#3:chr10 182788 183350 A
#4:chr10 224244 224500 A
#5:chr10 237695 237950 B
将新列添加到 peak
,这是全部为U。然后替换那些与A有任何重叠的行。这将包含在内完全的所有行。然后再次执行条件连接,但这次只能完全在内部,并用B替换。
请注意, I have 2 dataframes: and The coloumn V2 and V3 are start and end of regions (range) in both dataframes. I want to keep those rows in 1st row region doesn't exist in 2nd row of 3rd row of 4th row of 5th row falls in the region hence will be in category B. Expected output: Here are the dput of input dataframes: Edit: My new access df looks like this and now I also want to append the last column in my final output df: So now there is one extra condition which is if row in access falls in peaks range then also append the value in V4 in a new column in final df, if some region is not found then by default will be Here in Here's another (straightforward) solution using the non-equi joins implemented recently and available in the current development version of data.table, v1.9.7. See installation instructions here: Add a new column to Note that the 这篇关于查找数据帧值中的范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! foverlaps()
解决方案也可以正常工作(它也来自data.table包)。但是,新的非Equi连接很适合 c。[。data.table 语法,允许在加入时聚合/添加/更新cols 。 / p> > access
V1 V2 V3
1 chr10 136122 136533
2 chr10 179432 179769
3 chr10 182988 183371
4 chr10 224234 224489
5 chr10 237693 237958
> peaks
V1 V2 V3
1 chr10 126122 126533
2 chr10 179450 179730
3 chr10 182788 183350
4 chr10 224244 224500
5 chr10 237695 237950
peaks
dataframe for which access$V1 == peaks$V1
AND which fall in the range (or regions) of access
dataframe. For example the new dataframe will be like: peaks
dataframe's
access
dataframe so it will be assigned category U.peaks
falls in the given range in access
dataframe (2nd row) and will be assigned category B.peaks
doesn't completely fall in that region but it somehow overlaps with region in 3rd row of access
, for this I will assign category A.peaks
also doesn't overlap completely at it ends 11 number after the end of region in row 4 of access, this will also be in category A.> newdf
V1 V2 V3 V4
1 chr10 126122 126533 U
2 chr10 179450 179730 B
3 chr10 182788 183350 A
4 chr10 224244 224500 A
5 chr10 237695 237950 B
> dput(peaks)
structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "chr10", class = "factor"),
V2 = c(126122L, 179450L, 182788L, 224244L, 237695L), V3 = c(126533L,
179730L, 183350L, 224500L, 237950L)), .Names = c("V1", "V2",
"V3"), class = "data.frame", row.names = c(NA, -5L))
> dput(access)
structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "chr10", class = "factor"),
V2 = c(136122L, 179432L, 182988L, 224234L, 237693L), V3 = c(136533L,
179769L, 183371L, 224489L, 237958L)), .Names = c("V1", "V2",
"V3"), class = "data.frame", row.names = c(NA, -5L))
> access
V1 V2 V3 V4
1 chr10 136122 136533 found
2 chr10 179432 179769 notFound
3 chr10 182988 183371 found
4 chr10 224234 224489 found
5 chr10 237693 237958 notFound
notFound
. Therefore, final output will be:> newdf
V1 V2 V3 V4 V5
1 chr10 126122 126533 U notFound
2 chr10 179450 179730 B notFound
3 chr10 182788 183350 A found
4 chr10 224244 224500 A found
5 chr10 237695 237950 B notFound
row1$V5
the value is notFound because this region was not found and in remaining cases we got the values in V5 from modified access df. require(data.table) # v1.9.7+
setDT(access)
setDT(peaks)[, V4 := "U"] # no overlap
peaks[access, V4 := "A", on=.(V1, V2 <= V3, V3 >= V2)] # any overlap
peaks[access, V4 := "B", on=.(V1, V2 >= V2, V3 <= V3)] # completly within
# V1 V2 V3 V4
# 1: chr10 126122 126533 U
# 2: chr10 179450 179730 B
# 3: chr10 182788 183350 A
# 4: chr10 224244 224500 A
# 5: chr10 237695 237950 B
peaks
which is all "U". Then replace those rows where there's any kind of overlap with "A". That would contain all rows which are also completely "within". Then once again, perform a conditional join, but this time only for completely within, and replace with "B".
foverlaps()
solution would work just fine as well (it also comes from data.table package). But the new non-equi joins fits well with the [.data.table
syntax which allows to aggregate/add/update cols while joining.