两个数据框，如果值在第二个 df 的范围内，则从一个 df 连接 ID 和位置 [英] Two data frames, join ID's and position from one df if value is in the range of 2nd df

查看：16 发布时间：2021/11/16 23:20:13 r dataframe dplyr apply

本文介绍了两个数据框，如果值在第二个 df 的范围内，则从一个 df 连接 ID 和位置的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

从昨天开始我一直在努力解决我的问题......我认为，最好的解释是当我向您展示示例时，所以我有两个 df 的 x 和 y，它们看起来像这样:

Since yesterday I've been trying to solve my problem... I think, the best explanation will be when I show you the example, so I have two df's x and y and they look like this:

ID_x <- c('x1','x2','x3','x4','x5','x6','x7','x8','x9','x10')
chr_x <- c('chr1','chr1','chr1','chr1','chr1','chr1','chr1','chr1','chrX','chrY')
start <- c(1,2,3,4,50,60,70,80,9,100)
stop <- c(10,20,30,40,500,600,700,800,95,1000)
x <- data.frame(ID_x,chr_x,start,stop)

> x
   ID_x chr_x start stop
1    x1  chr1     1   10
2    x2  chr1     2   20
3    x3  chr1     3   30
4    x4  chr1     4   40
5    x5  chr1    50  500
6    x6  chr1    60  600
7    x7  chr1    70  700
8    x8  chr1    80  800
9    x9  chrX     9   90
10  x10  chrY   100 1000

第二个:

ID_y <- c('y1','y2','y3','y4','y5','y6','y7','y8','y9','y10')
chr_y <- c('chr1','chr1','chr1','chr1','chr2','chr2','chr3','chr3','chrX','chrY')
pos_y <- c(8,9,15,31,41,2,81,8,91,8)
y <- data.frame(ID_y,chr_y,pos_y)

> y
   ID_y chr_y pos_y
1    y1  chr1     8
2    y2  chr1     9
3    y3  chr1    15
4    y4  chr1    31
5    y5  chr2    41
6    y6  chr2     2
7    y7  chr3    81
8    y8  chr3     8
9    y9  chrX    91
10  y10  chrY     8

现在我想根据范围(开始、停止)和 chr 将行从第二个 df (y) 映射到第一个 df (x).我想列出从 y df (如果它们满足条件)到 df x 的每一行的所有 id.没有这样的 id_y，然后不适用.所以预期的结果:

Now I'd like to map rows from 2nd df (y) to first df (x) based on range (start, stop) and chr. I'd like to list all id's from y df (if they fulfill the criteria) to each row of df x. Of there's no such id_y, then NA. So the expectd result:

    ID_x chr_x start stop ID_y          pos_y
1    x1  chr1     1   10  y1,y2         8,9
2    x2  chr1     2   20  y1,y2,y3      8,9,15
3    x3  chr1     3   30  y1,y2,y3,y4   8,9,15,31
4    x4  chr1     4   40  y1,y2,y3,y4   8,9,15,31
5    x5  chr1    50  500  NA            NA
6    x6  chr1    60  600  NA            NA
7    x7  chr1    70  700  NA            NA
8    x8  chr1    80  800  NA            NA
9    x9  chrX     9   95  y9            91  
10  x10  chrY   100 1000  NA            NA

乍一看我以为是小菜一碟，但是当我试图解决这个问题时...我尝试了几种方法...嵌套 apply() 和 ifelse() 给了我一个巨大的矩阵，里面有很多 NA ......我发现了类似的问题和解决方案这里但我的原始数据集很大，df x 有 134748 行，df y 有 865859 行.我遇到了内存分配错误，即使我试图用一条染色体来做到这一点.

At first glance I thought that it's piece of cake, but when i tried to solve this... I tried several approaches... Nested apply() with ifelse() was giving me a huge matrix with a lots of NA's... I found similar problem and solution here but my original data set is huge, df x has 134748 rows and df y has 865859 rows. I had the memory allocation error, even when I tried to do this with one chromosome.

你能给我一个提示如何做到这一点吗?

Could you please give me a hint how to do this?

谢谢，亚当

推荐答案

使用 data.table 连接应该有更有效的解决方案，但这里是使用 的一种方法dplyr.我们首先通过"chr"列对x和y进行left_join，然后filter 位于 start 和 stop 之间的行，为每个 chr_x、start 和stop 最后 right_join 与 x 以使用 NA 获得缺失的范围.

There should be a more efficient solution to this using data.table joins but here is one way to do it using dplyr. We first do a left_join of x and y by "chr" columns, then filter rows which is between start and stop, create a comma-separated string for each chr_x, start and stop and finally right_join with x to get missing range with NA.

library(dplyr)

left_join(x, y, by = c("chr_x" = "chr_y")) %>%
   filter(data.table::between(pos_y, start, stop)) %>%
   group_by(chr_x, start, stop) %>%
   summarise(ID_y = toString(ID_y), pos_y = toString(pos_y)) %>%
   right_join(x)


#   chr_x start  stop ID_y           pos_y        ID_x 
#   <chr> <dbl> <dbl> <chr>          <chr>        <fct>
# 1 chr1      1    10 y1, y2         8, 9         x1   
# 2 chr1      2    20 y1, y2, y3     8, 9, 15     x2   
# 3 chr1      3    30 y1, y2, y3     8, 9, 15     x3   
# 4 chr1      4    40 y1, y2, y3, y4 8, 9, 15, 31 x4   
# 5 chr1     50   500 NA             NA           x5   
# 6 chr1     60   600 NA             NA           x6   
# 7 chr1     70   700 NA             NA           x7   
# 8 chr1     80   800 NA             NA           x8   
# 9 chrX      9    95 y9             91           x9   
#10 chrY    100  1000 NA             NA           x10

这篇关于两个数据框，如果值在第二个 df 的范围内，则从一个 df 连接 ID 和位置的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

两个数据框，如果值在第二个 df 的范围内，则从一个 df 连接 ID 和位置 [英] Two data frames, join ID's and position from one df if value is in the range of 2nd df

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

两个数据框，如果值在第二个 df 的范围内，则从一个 df 连接 ID 和位置 [英] Two data frames, join ID&#39;s and position from one df if value is in the range of 2nd df

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

两个数据框，如果值在第二个 df 的范围内，则从一个 df 连接 ID 和位置 [英] Two data frames, join ID's and position from one df if value is in the range of 2nd df

登录关闭