合并两个数据帧,两者都是基于最接近的位置的坐标 [英] Merging two data frames, both with coordinates based on the closest location
问题描述
df1< - data.frame(X = c(2,4,1,2,5),Y = c(6,7,8,9,8),V1 = c(A,B,C,D,E),V2 = c(G ,I,J,K))
And:
df2< - data.frame(X = c(2,4,6),Y = c(5,9,7),Dens = c(12,17,10))
我想添加一个列到df1包含密度(Dens)来自df2,如果有一个点合理靠近。如果没有任何一点接近,我希望它显示为NA。例如:
XY V1 V2 Dens
2 6 AG 12
4 7 BH NA
1 8 CI 17
2 9 DJ NA
5 8 EK 10
首先,我们来写一个函数来找到df2中df2最接近的一行。在这里我使用简单的笛卡尔距离(即(x1 - x2)^ 2 +(y1 - y2)^ 2
)。如果你已经有了/你可能想改变一个更好的公式:
mydist< - function(row){
dists < - (row [[X]] - df2 $ X)^ 2 +(row [[Y]] - df2 $ Y)^ 2
return(cbind [哪一个(dists),],distance = min(dists))
}
一旦你有这个,你只需要 lapply
它到每一行,并将其添加到你的第一个数据:
(df1,do.call(rbind,lapply(1:nrow(df1)),函数(x)mydist(df1 [x,]))))
对于您的测试数据,输出如下所示:
XY V1 V2 XY Dens距离
1 2 6 AG 2 5 12 1
2 4 7 BH 4 9 17 4
3 1 8 CI 2 5 12 10
21 2 9 DJ 4 9 17 4
22 5 8 EK 4 9 17 2
然后,您可以执行这样的操作,以过滤掉超出阈值的值:
- 5
z $ Dens [z $ distance>阈值]< - NA
XY V1 V2 XY Dens距离
1 2 6 AG 2 5 12 1
2 4 7 BH 4 9 17 4
3 1 8 CI 2 5 NA 10
21 2 9 DJ 4 9 17 4
22 5 8 EK 4 9 17 2
您的实际数据非常大(在我的电脑上,相同大小的模拟数据集大约需要10分钟)。如果可能,您应该 merge
,然后只运行这些不是完全匹配的(请参阅 dplyr :: anti_join
) 。
I have one large dataframe (~130000 rows) containing local variables and an other large dataframe (~7000 rows) containing the density of a species. Both have x and y coordinates but these coordinates don't always match. e.g:
df1 <- data.frame(X = c(2,4,1,2,5), Y = c(6,7,8,9,8), V1 = c("A", "B", "C", "D", "E"), V2 = c("G", "H", "I", "J", "K"))
And:
df2 <- data.frame(X = c(2,4,6), Y = c(5,9,7), Dens = c(12, 17, 10))
I would like to add a column to df1 containing the density (Dens) from df2 if there is a point reasonably close-by. If there is no point close-by I would like it to show up as a NA. e.g:
X Y V1 V2 Dens
2 6 A G 12
4 7 B H NA
1 8 C I 17
2 9 D J NA
5 8 E K 10
First, let's write a function to find the closest point in df2 for a single line of df1. Here I'm using simple cartesian distance (ie (x1 - x2)^2 + (y1 - y2)^2
). If you have lat/lon you might want to change it to a better formula:
mydist <- function(row){
dists <- (row[["X"]] - df2$X)^2 + (row[["Y"]]- df2$Y)^2
return(cbind(df2[which.min(dists),], distance = min(dists)))
}
Once you have this, you just need to lapply
it to each row, and add it to your first data:
z <- cbind(df1, do.call(rbind, lapply(1:nrow(df1), function(x) mydist(df1[x,]))))
For your test data, the output looks like:
X Y V1 V2 X Y Dens distance
1 2 6 A G 2 5 12 1
2 4 7 B H 4 9 17 4
3 1 8 C I 2 5 12 10
21 2 9 D J 4 9 17 4
22 5 8 E K 4 9 17 2
You can then do something like this to filter out those over your threshold:
threshold <- 5
z$Dens[z$distance > threshold] <- NA
X Y V1 V2 X Y Dens distance
1 2 6 A G 2 5 12 1
2 4 7 B H 4 9 17 4
3 1 8 C I 2 5 NA 10
21 2 9 D J 4 9 17 4
22 5 8 E K 4 9 17 2
Your actual data is very large (a simulated data set of the same size takes about 10 minutes on my computer). If possible you should merge
, then only run this on those those are not exact matches (see dplyr::anti_join
).
这篇关于合并两个数据帧,两者都是基于最接近的位置的坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!