在python中将500,000个地理空间点聚类 [英] Clustering 500,000 geospatial points in python

查看:312
本文介绍了在python中将500,000个地理空间点聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前面临的问题是寻找一种在python中将500,000个纬度/经度对聚类的方法。到目前为止,我已经尝试使用numpy计算距离矩阵(以传递到scikit学习DBSCAN中),但是输入量如此之大,它会迅速吐出内存错误。

I'm currently faced with the problem of finding a way to cluster around 500,000 latitude/longitude pairs in python. So far I've tried computing a distance matrix with numpy (to pass into the scikit-learn DBSCAN) but with such a large input it quickly spits out a Memory Error.

这些点存储在元组中,该元组包含该点的纬度,经度和数据值。

The points are stored in tuples containing the latitude, longitude, and the data value at that point.

简而言之,在python中对大量纬度/经度对进行空间聚类的最有效方法是什么?对于这个应用程序,我愿意以速度为名牺牲一些精度。

In short, what is the most efficient way to spatially cluster a large number of latitude/longitude pairs in python? For this application, I'm willing to sacrifice some accuracy in the name of speed.

编辑:
寻找算法的簇数未知

The number of clusters for the algorithm to find is unknown ahead of time.

推荐答案

我没有您的数据,所以我只生成了500k随机数分成三列。

I don't have your data so I just generated 500k random numbers into three columns.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans2, whiten

arr = np.random.randn(500000*3).reshape((500000, 3))
x, y = kmeans2(whiten(arr), 7, iter = 20)  #<--- I randomly picked 7 clusters
plt.scatter(arr[:,0], arr[:,1], c=y, alpha=0.33333);

out[1]:

我为此设置了时间,并且花了1.96秒来运行此Kmeans2,因此我认为这与您的数据大小无关。将数据放入500000 x 3 numpy数组中,然后尝试kmeans2。

I timed this and it took 1.96 seconds to run this Kmeans2 so I don't think it has to do with the size of your data. Put your data in a 500000 x 3 numpy array and try kmeans2.

这篇关于在python中将500,000个地理空间点聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆