DBSCAN用于集群地理位置数据 [英] DBSCAN for clustering of geographic location data
问题描述
我有一个包含经度和纬度对的数据框。
I have a dataframe with latitude and longitude pairs.
这是我的数据框。
order_lat order_long
0 19.111841 72.910729
1 19.111342 72.908387
2 19.111342 72.908387
3 19.137815 72.914085
4 19.119677 72.905081
5 19.119677 72.905081
6 19.119677 72.905081
7 19.120217 72.907121
8 19.120217 72.907121
9 19.119677 72.905081
10 19.119677 72.905081
11 19.119677 72.905081
12 19.111860 72.911346
13 19.111860 72.911346
14 19.119677 72.905081
15 19.119677 72.905081
16 19.119677 72.905081
17 19.137815 72.914085
18 19.115380 72.909144
19 19.115380 72.909144
20 19.116168 72.909573
21 19.119677 72.905081
22 19.137815 72.914085
23 19.137815 72.914085
24 19.112955 72.910102
25 19.112955 72.910102
26 19.112955 72.910102
27 19.119677 72.905081
28 19.119677 72.905081
29 19.115380 72.909144
30 19.119677 72.905081
31 19.119677 72.905081
32 19.119677 72.905081
33 19.119677 72.905081
34 19.119677 72.905081
35 19.111860 72.911346
36 19.111841 72.910729
37 19.131674 72.918510
38 19.119677 72.905081
39 19.111860 72.911346
40 19.111860 72.911346
41 19.111841 72.910729
42 19.111841 72.910729
43 19.111841 72.910729
44 19.115380 72.909144
45 19.116625 72.909185
46 19.115671 72.908985
47 19.119677 72.905081
48 19.119677 72.905081
49 19.119677 72.905081
50 19.116183 72.909646
51 19.113827 72.893833
52 19.119677 72.905081
53 19.114100 72.894985
54 19.107491 72.901760
55 19.119677 72.905081
我想将此点彼此最靠近的点聚类(200米rs距离)以下是我的距离矩阵。
I want to cluster this points which are nearest to each other(200 meters distance) following is my distance matrix.
from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))
array([[ 0. , 0.2522482 , 0.2522482 , ..., 1.67313071,
1.05925366, 1.05420922],
[ 0.2522482 , 0. , 0. , ..., 1.44111548,
0.81742536, 0.98978355],
[ 0.2522482 , 0. , 0. , ..., 1.44111548,
0.81742536, 0.98978355],
...,
[ 1.67313071, 1.44111548, 1.44111548, ..., 0. ,
1.02310118, 1.22871515],
[ 1.05925366, 0.81742536, 0.81742536, ..., 1.02310118,
0. , 1.39923529],
[ 1.05420922, 0.98978355, 0.98978355, ..., 1.22871515,
1.39923529, 0. ]])
然后我在距离矩阵上应用DBSCAN聚类算法。
Then I am applying DBSCAN clustering algorithm on distance matrix.
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
y_db = db.fit_predict(distance_matrix)
我不知道如何选择eps& min_samples值。它将太远的点聚集在一个群集中。(距离约2 km)是因为它在群集时计算欧几里德距离吗?
I don't know how to choose eps & min_samples value. It clusters the points which are way too far, in one cluster.(approx 2 km in distance) Is it because it calculates euclidean distance while clustering? please help.
推荐答案
DBSCAN是 meant ,可用于原始数据,并具有空间索引加速。我所知道的唯一可以加速地理距离的工具是 ELKI (Java)-scikit-learn仅支持一些距离,例如欧几里得距离(请参见 sklearn.neighbors.NearestNeighbors
)。
但是,显然,您可以尝试预先计算两两之间的距离,所以这还不是问题。
DBSCAN is meant to be used on the raw data, with a spatial index for acceleration. The only tool I know with acceleration for geo distances is ELKI (Java) - scikit-learn unfortunately only supports this for a few distances like Euclidean distance (see sklearn.neighbors.NearestNeighbors
).
But apparently, you can affort to precompute pairwise distances, so this is not (yet) an issue.
但是,您没有足够仔细地阅读文档,并且您认为DBSCAN使用距离矩阵是错误的:
However, you did not read the documentation carefully enough, and your assumption that DBSCAN uses a distance matrix is wrong:
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)
使用欧几里得距离距离矩阵行,这显然没有任何意义。
uses Euclidean distance on the distance matrix rows, which obviously does not make any sense.
请参见 DBSCAN
的文档(添加了重点) :
See the documentation of DBSCAN
(emphasis added):
class sklearn.cluster.DBSCAN(eps = 0.5,min_samples = 5, metric ='euclidean',algorithm =' auto',leaf_size = 30,p =无,random_state =无)
class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)
指标:字符串或可调用
计算要素阵列中实例之间的距离时使用的度量。如果metric是字符串或可调用,则它必须是metric.pairwise.calculate_distance为其metric参数所允许的选项之一。 如果度量是预先计算的,则X被假定为距离矩阵,并且必须是正方形。 X可以是稀疏矩阵,在这种情况下,仅非零元素可以被视为DBSCAN的邻居。 / p>
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is "precomputed", X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only "nonzero" elements may be considered neighbors for DBSCAN.
与 fit_predict
类似:
X :形状(n_samples,n_features)的形状的数组或稀疏(CSR)矩阵,或形状(n_samples,n_samples)的形状的数组(稀疏)
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
特征数组,或样本之间的距离数组 if metric ='precomputed'。
A feature array, or array of distances between samples if metric='precomputed'.
换句话说,您需要做
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")
这篇关于DBSCAN用于集群地理位置数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!