R-根据位置将庞大的经度/纬度数据框分为几组 [英] R - dividing a huge dataframe of latitude/longitude points into groups according to location
问题描述
我是R的新手,但我听说使用for
循环确实是个坏主意.我有使用它们的工作代码,但我想对其进行改进,因为处理大数据的速度非常慢.我已经有一些想法来改进算法,但是我不知道如何向量化它,或者不使用for
循环就可以做到这一点.
I am new at R, but I hear that it is really a bad idea to use for
loops. I have working code using them, but I would like to improve it because it's extremely slow with big data. I already have a few ideas how to improve the algorithm, but what I don't know is how to vectorize this, or do it without for
loops.
我只是将纬度/经度点分组为一个以半径为参数的圆.
I am simply grouping lat/lng points into a circles with radius as parameter.
该函数的示例输出(仅填充circle_id列中的值),半径设置为100米:
An example output of the function(only fills the values in the circle_id column), radius was set to 100 meters:
[1] "Locations: "
latitude longitude sensor_time sensor_time2 circle_id
48.15144 17.07569 1447149703 2015-11-10 11:01:43 1
48.15404 17.07452 1447149743 2015-11-10 11:02:23 2
48.15277 17.07514 1447149762 2015-11-10 11:02:42 3
48.15208 17.07538 1447149771 2015-11-10 11:02:51 1
48.15461 17.07560 1447149773 2015-11-10 11:02:53 4
48.15139 17.07562 1447149811 2015-11-10 11:03:31 1
48.15446 17.07517 1447149866 2015-11-10 11:04:26 2
48.15266 17.07330 1447149993 2015-11-10 11:06:33 5
所以我有2个for循环,loop1遍历每条线,loop2遍历每一个先前的circle_id,并检查loop1的当前位置是否在loop2现有圆的半径之内.每个circle_id的中心是在前一个半径范围之外的第一个位置.
So I have 2 for loops, loop1 goes through every line and loop2 goes through every previous circle_id's and checks if current location from loop1 is within the radius of existing circles from loop2. The centre of each circle_id is the first location found outside all previous one's radius.
代码如下:
init_circles = function(datfr, radius) {
cnt = 1
datfr$circle_id[1] = 1
longitude = datfr$longitude[1]
latitude = datfr$latitude[1]
circle_id = datfr$circle_id[1]
datfr2 <- data.frame(longitude, latitude, circle_id)
for (i in 2:NROW(datfr)) {
for (j in 1:NROW(datfr2)) {
tmp = distHaversine(c(datfr$longitude[i],datfr$latitude[i]) ,c(datfr2$longitude[j],datfr2$latitude[j]))
if (tmp < radius){
datfr$circle_id[i] = datfr2$circle_id[j]
break
}
}
if (datfr$circle_id[i]<1){
cnt = cnt +1
datfr$circle_id[i] = cnt
datfr2[nrow(datfr2)+1,] = c(datfr$longitude[i],datfr$latitude[i],datfr$circle_id[i])
}
}
return(datfr)
}
datfr 是未设置circle_id的输入数据框,而 datfr2 是包含现有圆的临时数据框.
datfr is the input dataframe without circle_id's set, and datfr2 is a temporary dataframe containing already existing circles.
这是视觉输出:
here is a visual output:
您可以看到这些圆圈的用途,上方的红色圆圈还有其他21个适合其半径的位置(21 + 1个原始位置= 22)
You can see what those circles are used for, the upper red circle has 21 other locations that fit within its radius (21 + 1 original = 22)
非常感谢您的帮助, 艾琳娜(Alena)
Thank you so much for helping, Alena
推荐答案
我假设我们有一个数据框circles
,每个圆的中心和半径,并且您问题中发布的样本数据在数据中框架称为dat
.下面的代码对距离的计算进行矢量化处理,并使用lapply
计算每个点到每个圆心的距离,并确定每个点是否在该圆的半径之内.
I've assumed we have a data frame circles
with the center and radius of each circle and that the sample data posted in your question is in a data frame called dat
. The code below vectorizes the calculation of distance and uses lapply
to calculate the distance of each point from the center of each circle and to determine if each point is inside the radius of that circle.
library(geosphere)
# We'll check the distance of each data point from the center of each
# of these circles
circles = data.frame(ID=1:2, lon=c(17.074, 17.076), lat=c(48.1513, 48.15142),
radius=c(180,190))
datNew = lapply(1:nrow(circles), function(i) {
df = dat
df$dist = distHaversine(df[,c("longitude", "latitude")],
circles[rep(i,nrow(df)), c('lon','lat')])
df$in_circle = ifelse(df$dist <= circles[i, "radius"], "Yes", "No")
df$circle_id = circles[i, "ID"]
df
})
datNew = do.call(rbind, datNew)
datNew
latitude longitude sensor_time sensor_time2 time3 dist in_circle circle_id
1 48.15144 17.07569 1447149703 2015-11-10 11:01:43 126.47756 Yes 1
2 48.15404 17.07452 1447149743 2015-11-10 11:02:23 307.45048 No 1
3 48.15277 17.07514 1447149762 2015-11-10 11:02:42 184.24465 No 1
4 48.15208 17.07538 1447149771 2015-11-10 11:02:51 134.32601 Yes 1
5 48.15461 17.07560 1447149773 2015-11-10 11:02:53 387.15358 No 1
6 48.15139 17.07562 1447149811 2015-11-10 11:03:31 120.73138 Yes 1
7 48.15446 17.07517 1447149866 2015-11-10 11:04:26 362.34236 No 1
8 48.15266 17.07330 1447149993 2015-11-10 11:06:33 160.07179 Yes 1
9 48.15144 17.07569 1447149703 2015-11-10 11:01:43 23.13059 Yes 2
10 48.15404 17.07452 1447149743 2015-11-10 11:02:23 311.68096 No 2
11 48.15277 17.07514 1447149762 2015-11-10 11:02:42 163.29068 Yes 2
12 48.15208 17.07538 1447149771 2015-11-10 11:02:51 86.70762 Yes 2
13 48.15461 17.07560 1447149773 2015-11-10 11:02:53 356.34955 No 2
14 48.15139 17.07562 1447149811 2015-11-10 11:03:31 28.41890 Yes 2
15 48.15446 17.07517 1447149866 2015-11-10 11:04:26 343.97933 No 2
16 48.15266 17.07330 1447149993 2015-11-10 11:06:33 243.44024 No 2
因此,我们现在有了一个数据框,该框告诉我们每个点是否在给定的圆内.数据帧为长格式,这意味着原始数据帧dat
中每个点都有n
行,其中n
是circles
数据帧中的行数.从这里开始,您可以进行进一步的处理,例如仅对多个圆圈中的每个点保留一行,等等.
So we now have a data frame telling us whether each point is inside a given circle. The data frame is in long format, meaning that there are n
rows for each point in the original data frame dat
, where n
is the number of rows in the circles
data frame. From here, you can do further processing, such as just keeping one row for each point that's in multiple circles, etc.
这是一个例子.我们将返回一个数据框,其中列出了一个点在其中的圆,或者如果该点不在任何圆内,则返回"None":
Here's an example. We'll return a data frame listing which circles a point is inside of, or return "None" if the point is not inside any circle:
library(dplyr)
datNew %>%
group_by(latitude, longitude) %>%
summarise(in_which_circles = if(any(in_circle=="Yes")) paste(circle_id[in_circle=="Yes"], collapse=",") else "None")
latitude longitude in_which_circles
<dbl> <dbl> <chr>
1 48.15139 17.07562 1,2
2 48.15144 17.07569 1,2
3 48.15208 17.07538 1,2
4 48.15266 17.07330 1
5 48.15277 17.07514 2
6 48.15404 17.07452 None
7 48.15446 17.07517 None
8 48.15461 17.07560 None
这篇关于R-根据位置将庞大的经度/纬度数据框分为几组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!