具有测地线或大圆距的R中的空间测地纬度经度聚类方法 [英] Approaches for spatial geodesic latitude longitude clustering in R with geodesic or great circle distances

查看:106
本文介绍了具有测地线或大圆距的R中的空间测地纬度经度聚类方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将一些基本的聚类技术应用于某些经纬度坐标。沿着聚类(或一些无监督的学习)的方式,将坐标分成由其。美国的shapefile来自人口普查局



编辑以回应@ Anony-Mousse评论:



在两个簇之间划分 LA似乎很奇怪但是,展开地图会发现,对于这种随机选择的城市,第3类和第4类之间存在差距。第4类基本上是圣塔莫尼卡和伯班克;第3类是帕萨迪纳,南洛杉矶,长滩以及该州以南的所有事物。



K均值聚类(4个聚类)确实使LA / Santa Monica /伯班克/长滩在一个集群中(见下文)。这只是归因于 kmeans(...) hclust(...)使用的不同算法。

  km<-kmeans(d,centers = 4)
df $ clust<-km $ cluster



值得注意的是,这些方法要求所有点都必须归入某个簇。如果您只问哪些点靠得很近,并且允许某些城市不进入任何集群,您得到的结果就会大不相同。


I would like to apply some basic clustering techniques to some latitude and longitude coordinates. Something along the lines of clustering (or some unsupervised learning) the coordinates into groups determined either by their great circle distance or their geodesic distance. NOTE: this could be a very poor approach, so please advise.

Ideally, I would like to tackle this in R.

I have done some searching, but perhaps I missed a solid approach? I have come across the packages: flexclust and pam -- however, I have not come across a clear-cut example(s) with respect to the following:

  1. Defining my own distance function.
  2. Do either flexclut (via kcca or cclust) or pam take into account random restarts?
  3. Icing on the cake = does anyone know of approaches/packages that would allow one to specify the minimum number of elements in each cluster?

解决方案

Regarding your first question: Since the data is long/lat, one approach is to use earth.dist(...) in package fossil (calculates great circle dist):

library(fossil)
d = earth.dist(df)    # distance object

Another approach uses distHaversine(...) in the geosphere package:

geo.dist = function(df) {
  require(geosphere)
  d <- function(i,z){         # z[1:2] contain long, lat
    dist <- rep(0,nrow(z))
    dist[i:nrow(z)] <- distHaversine(z[i:nrow(z),1:2],z[i,1:2])
    return(dist)
  }
  dm <- do.call(cbind,lapply(1:nrow(df),d,df))
  return(as.dist(dm))
}

The advantage here is that you can use any of the other distance algorithms in geosphere, or you can define your own distance function and use it in place of distHaversine(...). Then apply any of the base R clustering techniques (e.g., kmeans, hclust):

km <- kmeans(geo.dist(df),centers=3)  # k-means, 3 clusters
hc <- hclust(geo.dist(df))            # hierarchical clustering, dendrogram
clust <- cutree(hc, k=3)              # cut the dendrogram to generate 3 clusters

Finally, a real example:

setwd("<directory with all files...>")
cities <- read.csv("GeoLiteCity-Location.csv",header=T,skip=1)
set.seed(123)
CA     <- cities[cities$country=="US" & cities$region=="CA",]
CA     <- CA[sample(1:nrow(CA),100),]   # 100 random cities in California
df     <- data.frame(long=CA$long, lat=CA$lat, city=CA$city)

d      <- geo.dist(df)   # distance matrix
hc     <- hclust(d)      # hierarchical clustering
plot(hc)                 # dendrogram suggests 4 clusters
df$clust <- cutree(hc,k=4)

library(ggplot2)
library(rgdal)
map.US  <- readOGR(dsn=".", layer="tl_2013_us_state")
map.CA  <- map.US[map.US$NAME=="California",]
map.df  <- fortify(map.CA)
ggplot(map.df)+
  geom_path(aes(x=long, y=lat, group=group))+
  geom_point(data=df, aes(x=long, y=lat, color=factor(clust)), size=4)+
  scale_color_discrete("Cluster")+
  coord_fixed()

The city data is from GeoLite. The US States shapefile is from the Census Bureau.

Edit in response to @Anony-Mousse comment:

It may seem odd that "LA" is divided between two clusters, however, expanding the map shows that, for this random selection of cities, there is a gap between cluster 3 and cluster 4. Cluster 4 is basically Santa Monica and Burbank; cluster 3 is Pasadena, South LA, Long Beach, and everything south of that.

K-means clustering (4 clusters) does keep the area around LA/Santa Monica/Burbank/Long Beach in one cluster (see below). This just comes down to the different algorithms used by kmeans(...) and hclust(...).

km <- kmeans(d, centers=4)
df$clust <- km$cluster

It's worth noting that these methods require that all points must go into some cluster. If you just ask which points are close together, and allow that some cities don't go into any cluster, you get very different results.

这篇关于具有测地线或大圆距的R中的空间测地纬度经度聚类方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆