合并两个数据帧,两者都是基于最接近的位置的坐标 [英] Merging two data frames, both with coordinates based on the closest location

查看:95
本文介绍了合并两个数据帧,两者都是基于最接近的位置的坐标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含局部变量的大数据帧(〜130000行)和一个包含一个物种密度的大数据帧(〜7000行)。两者都有x和y坐标,但这些坐标并不总是匹配。例如:

  df1<  -  data.frame(X = c(2,4,1,2,5),Y = c(6,7,8,9,8),V1 = c(A,B,C,D,E),V2 = c(G ,I,J,K))

And:

  df2<  -  data.frame(X = c(2,4,6),Y = c(5,9,7),Dens = c(12,17,10))

我想添加一个列到df1包含密度(Dens)来自df2,如果有一个点合理靠近。如果没有任何一点接近,我希望它显示为NA。例如:

  XY V1 V2 Dens 
2 6 AG 12
4 7 BH NA
1 8 CI 17
2 9 DJ NA
5 8 EK 10


解决方案

首先,我们来写一个函数来找到df2中df2最接近的一行。在这里我使用简单的笛卡尔距离(即(x1 - x2)^ 2 +(y1 - y2)^ 2 )。如果你已经有了/你可能想改变一个更好的公式:

  mydist<  -  function(row){ 
dists < - (row [[X]] - df2 $ X)^ 2 +(row [[Y]] - df2 $ Y)^ 2
return(cbind [哪一个(dists),],distance = min(dists))
}

一旦你有这个,你只需要 lapply 它到每一行,并将其添加到你的第一个数据:



(df1,do.call(rbind,lapply(1:nrow(df1)),函数(x)mydist(df1 [x,]))))


对于您的测试数据,输出如下所示:

  XY V1 V2 XY Dens距离
1 2 6 AG 2 5 12 1
2 4 7 BH 4 9 17 4
3 1 8 CI 2 5 12 10
21 2 9 DJ 4 9 17 4
22 5 8 EK 4 9 17 2

然后,您可以执行这样的操作,以过滤掉超出阈值的值:

  -  5 
z $ Dens [z $ distance>阈值]< - NA

XY V1 V2 XY Dens距离
1 2 6 AG 2 5 12 1
2 4 7 BH 4 9 17 4
3 1 8 CI 2 5 NA 10
21 2 9 DJ 4 9 17 4
22 5 8 EK 4 9 17 2

您的实际数据非常大(在我的电脑上,相同大小的模拟数据集大约需要10分钟)。如果可能,您应该 merge ,然后只运行这些不是完全匹配的(请参阅 dplyr :: anti_join ) 。


I have one large dataframe (~130000 rows) containing local variables and an other large dataframe (~7000 rows) containing the density of a species. Both have x and y coordinates but these coordinates don't always match. e.g:

df1 <- data.frame(X = c(2,4,1,2,5), Y = c(6,7,8,9,8), V1 = c("A", "B", "C", "D", "E"), V2 = c("G", "H", "I", "J", "K"))

And:

df2 <- data.frame(X = c(2,4,6), Y = c(5,9,7), Dens = c(12, 17, 10))

I would like to add a column to df1 containing the density (Dens) from df2 if there is a point reasonably close-by. If there is no point close-by I would like it to show up as a NA. e.g:

X Y   V1   V2    Dens
2 6   A    G      12
4 7   B    H      NA     
1 8   C    I      17
2 9   D    J      NA
5 8   E    K      10

解决方案

First, let's write a function to find the closest point in df2 for a single line of df1. Here I'm using simple cartesian distance (ie (x1 - x2)^2 + (y1 - y2)^2). If you have lat/lon you might want to change it to a better formula:

mydist <- function(row){
  dists <- (row[["X"]] - df2$X)^2 + (row[["Y"]]- df2$Y)^2
  return(cbind(df2[which.min(dists),], distance = min(dists)))
}

Once you have this, you just need to lapply it to each row, and add it to your first data:

z <- cbind(df1, do.call(rbind, lapply(1:nrow(df1), function(x) mydist(df1[x,])))) 

For your test data, the output looks like:

   X Y V1 V2 X Y Dens distance
1  2 6  A  G 2 5   12        1
2  4 7  B  H 4 9   17        4
3  1 8  C  I 2 5   12       10
21 2 9  D  J 4 9   17        4
22 5 8  E  K 4 9   17        2

You can then do something like this to filter out those over your threshold:

threshold <- 5
z$Dens[z$distance > threshold] <- NA

   X Y V1 V2 X Y Dens distance
1  2 6  A  G 2 5   12        1
2  4 7  B  H 4 9   17        4
3  1 8  C  I 2 5   NA       10
21 2 9  D  J 4 9   17        4
22 5 8  E  K 4 9   17        2

Your actual data is very large (a simulated data set of the same size takes about 10 minutes on my computer). If possible you should merge, then only run this on those those are not exact matches (see dplyr::anti_join).

这篇关于合并两个数据帧,两者都是基于最接近的位置的坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆