聚类分析 - sklearn的kmeans使用的是哪种距离度量?
本文介绍了聚类分析 - sklearn的kmeans使用的是哪种距离度量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
问 题
sklearn的dbscan等其他算法都会有一个metric参数来指定距离度量。
为什么kmeans没有这样的参数。
看了好久源码也没弄懂它默认的是哪种度量。
sklearn的源码如下:
class KMeans(BaseEstimator, ClusterMixin, TransformerMixin):
"""K-Means clustering
Read more in the :ref:`User Guide <k_means>`.
Parameters
----------
n_clusters : int, optional, default: 8
The number of clusters to form as well as the number of
centroids to generate.
max_iter : int, default: 300
Maximum number of iterations of the k-means algorithm for a
single run.
n_init : int, default: 10
Number of time the k-means algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of inertia.
init : {'k-means++', 'random' or an ndarray}
Method for initialization, defaults to 'k-means++':
'k-means++' : selects initial cluster centers for k-mean
clustering in a smart way to speed up convergence. See section
Notes in k_init for more details.
'random': choose k observations (rows) at random from data for
the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features)
and gives the initial centers.
algorithm : "auto", "full" or "elkan", default="auto"
K-means algorithm to use. The classical EM-style algorithm is "full".
The "elkan" variation is more efficient by using the triangle
inequality, but currently doesn't support sparse data. "auto" chooses
"elkan" for dense data and "full" for sparse data.
precompute_distances : {'auto', True, False}
Precompute distances (faster but takes more memory).
'auto' : do not precompute distances if n_samples * n_clusters > 12
million. This corresponds to about 100MB overhead per job using
double precision.
True : always precompute distances
False : never precompute distances
tol : float, default: 1e-4
Relative tolerance with regards to inertia to declare convergence
n_jobs : int
The number of jobs to use for the computation. This works by computing
each of the n_init runs in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is
used at all, which is useful for debugging. For n_jobs below -1,
(n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one
are used.
random_state : integer or numpy.RandomState, optional
The generator used to initialize the centers. If an integer is
given, it fixes the seed. Defaults to the global numpy random
number generator.
verbose : int, default 0
Verbosity mode.
copy_x : boolean, default True
When pre-computing distances it is more numerically accurate to center
the data first. If copy_x is True, then the original data is not
modified. If False, the original data is modified, and put back before
the function returns, but small numerical differences may be introduced
by subtracting and then adding the data mean.
Attributes
----------
cluster_centers_ : array, [n_clusters, n_features]
Coordinates of cluster centers
labels_ :
Labels of each point
inertia_ : float
Sum of distances of samples to their closest cluster center.
Examples
--------
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [4, 2], [4, 4], [4, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
>>> kmeans.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
See also
--------
MiniBatchKMeans
Alternative online implementation that does incremental updates
of the centers positions using mini-batches.
For large scale learning (say n_samples > 10k) MiniBatchKMeans is
probably much faster than the default batch implementation.
Notes
------
The k-means problem is solved using Lloyd's algorithm.
The average complexity is given by O(k n T), were n is the number of
samples and T is the number of iteration.
The worst case complexity is given by O(n^(k+2/p)) with
n = n_samples, p = n_features. (D. Arthur and S. Vassilvitskii,
'How slow is the k-means method?' SoCG2006)
In practice, the k-means algorithm is very fast (one of the fastest
clustering algorithms available), but it falls in local minima. That's why
it can be useful to restart it several times.
"""
def __init__(self, n_clusters=8, init='k-means++', n_init=10,
max_iter=300, tol=1e-4, precompute_distances='auto',
verbose=0, random_state=None, copy_x=True,
n_jobs=1, algorithm='auto'):
self.n_clusters = n_clusters
self.init = init
self.max_iter = max_iter
self.tol = tol
self.precompute_distances = precompute_distances
self.n_init = n_init
self.verbose = verbose
self.random_state = random_state
self.copy_x = copy_x
self.n_jobs = n_jobs
self.algorithm = algorithm
def _check_fit_data(self, X):
"""Verify that the number of samples given is larger than k"""
X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
if X.shape[0] < self.n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
X.shape[0], self.n_clusters))
return X
def _check_test_data(self, X):
X = check_array(X, accept_sparse='csr', dtype=FLOAT_DTYPES)
n_samples, n_features = X.shape
expected_n_features = self.cluster_centers_.shape[1]
if not n_features == expected_n_features:
raise ValueError("Incorrect number of features. "
"Got %d features, expected %d" % (
n_features, expected_n_features))
return X
def fit(self, X, y=None):
"""Compute k-means clustering.
Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster.
"""
random_state = check_random_state(self.random_state)
X = self._check_fit_data(X)
self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_ = \
k_means(
X, n_clusters=self.n_clusters, init=self.init,
n_init=self.n_init, max_iter=self.max_iter, verbose=self.verbose,
precompute_distances=self.precompute_distances,
tol=self.tol, random_state=random_state, copy_x=self.copy_x,
n_jobs=self.n_jobs, algorithm=self.algorithm,
return_n_iter=True)
return self
def fit_predict(self, X, y=None):
"""Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by
predict(X).
"""
return self.fit(X).labels_
def fit_transform(self, X, y=None):
"""Compute clustering and transform X to cluster-distance space.
Equivalent to fit(X).transform(X), but more efficiently implemented.
"""
# Currently, this just skips a copy of the data if it is not in
# np.array or CSR format already.
# XXX This skips _check_test_data, which may change the dtype;
# we should refactor the input validation.
X = self._check_fit_data(X)
return self.fit(X)._transform(X)
def transform(self, X, y=None):
"""Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster
centers. Note that even if X is sparse, the array returned by
`transform` will typically be dense.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
Returns
-------
X_new : array, shape [n_samples, k]
X transformed in the new space.
"""
check_is_fitted(self, 'cluster_centers_')
X = self._check_test_data(X)
return self._transform(X)
def _transform(self, X):
"""guts of transform method; no input validation"""
return euclidean_distances(X, self.cluster_centers_)
def predict(self, X):
"""Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, `cluster_centers_` is called
the code book and each value returned by `predict` is the index of
the closest code in the code book.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
Returns
-------
labels : array, shape [n_samples,]
Index of the cluster each sample belongs to.
"""
check_is_fitted(self, 'cluster_centers_')
X = self._check_test_data(X)
x_squared_norms = row_norms(X, squared=True)
return _labels_inertia(X, x_squared_norms, self.cluster_centers_)[0]
def score(self, X, y=None):
"""Opposite of the value of X on the K-means objective.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
New data.
Returns
-------
score : float
Opposite of the value of X on the K-means objective.
"""
check_is_fitted(self, 'cluster_centers_')
X = self._check_test_data(X)
x_squared_norms = row_norms(X, squared=True)
return -_labels_inertia(X, x_squared_norms, self.cluster_centers_)[1]
解决方案
kmeans默认使用欧氏距离,这是算法设计之初的度量基础。
原因是算法涉及平均值的计算:
1. 非距离度量不是严格意义上的度量,无法计算平均值,
2. 其余的距离度量的平均值没有实际意义?(不确定)
详见sklearn的k_means()函数源码:
里面有一句:
# precompute squared norms of data points
`x_squared_norms = row_norms(X, squared=True)`
由此可以看出,sklearn的kmeans内部是在计算平方欧氏距离。
def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
n_init=10, max_iter=300, verbose=False,
tol=1e-4, random_state=None, copy_x=True, n_jobs=1,
algorithm="auto", return_n_iter=False):
"""K-means clustering algorithm.
Read more in the :ref:`User Guide <k_means>`.
Parameters
----------
X : array-like or sparse matrix, shape (n_samples, n_features)
The observations to cluster.
n_clusters : int
The number of clusters to form as well as the number of
centroids to generate.
max_iter : int, optional, default 300
Maximum number of iterations of the k-means algorithm to run.
n_init : int, optional, default: 10
Number of time the k-means algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of inertia.
init : {'k-means++', 'random', or ndarray, or a callable}, optional
Method for initialization, default to 'k-means++':
'k-means++' : selects initial cluster centers for k-mean
clustering in a smart way to speed up convergence. See section
Notes in k_init for more details.
'random': generate k centroids from a Gaussian with mean and
variance estimated from the data.
If an ndarray is passed, it should be of shape (n_clusters, n_features)
and gives the initial centers.
If a callable is passed, it should take arguments X, k and
and a random state and return an initialization.
algorithm : "auto", "full" or "elkan", default="auto"
K-means algorithm to use. The classical EM-style algorithm is "full".
The "elkan" variation is more efficient by using the triangle
inequality, but currently doesn't support sparse data. "auto" chooses
"elkan" for dense data and "full" for sparse data.
precompute_distances : {'auto', True, False}
Precompute distances (faster but takes more memory).
'auto' : do not precompute distances if n_samples * n_clusters > 12
million. This corresponds to about 100MB overhead per job using
double precision.
True : always precompute distances
False : never precompute distances
tol : float, optional
The relative increment in the results before declaring convergence.
verbose : boolean, optional
Verbosity mode.
random_state : integer or numpy.RandomState, optional
The generator used to initialize the centers. If an integer is
given, it fixes the seed. Defaults to the global numpy random
number generator.
copy_x : boolean, optional
When pre-computing distances it is more numerically accurate to center
the data first. If copy_x is True, then the original data is not
modified. If False, the original data is modified, and put back before
the function returns, but small numerical differences may be introduced
by subtracting and then adding the data mean.
n_jobs : int
The number of jobs to use for the computation. This works by computing
each of the n_init runs in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is
used at all, which is useful for debugging. For n_jobs below -1,
(n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one
are used.
return_n_iter : bool, optional
Whether or not to return the number of iterations.
Returns
-------
centroid : float ndarray with shape (k, n_features)
Centroids found at the last iteration of k-means.
label : integer ndarray with shape (n_samples,)
label[i] is the code or index of the centroid the
i'th observation is closest to.
inertia : float
The final value of the inertia criterion (sum of squared distances to
the closest centroid for all observations in the training set).
best_n_iter: int
Number of iterations corresponding to the best results.
Returned only if `return_n_iter` is set to True.
"""
if n_init <= 0:
raise ValueError("Invalid number of initializations."
" n_init=%d must be bigger than zero." % n_init)
random_state = check_random_state(random_state)
if max_iter <= 0:
raise ValueError('Number of iterations should be a positive number,'
' got %d instead' % max_iter)
best_inertia = np.infty
X = as_float_array(X, copy=copy_x)
tol = _tolerance(X, tol)
# If the distances are precomputed every job will create a matrix of shape
# (n_clusters, n_samples). To stop KMeans from eating up memory we only
# activate this if the created matrix is guaranteed to be under 100MB. 12
# million entries consume a little under 100MB if they are of type double.
if precompute_distances == 'auto':
n_samples = X.shape[0]
precompute_distances = (n_clusters * n_samples) < 12e6
elif isinstance(precompute_distances, bool):
pass
else:
raise ValueError("precompute_distances should be 'auto' or True/False"
", but a value of %r was passed" %
precompute_distances)
# subtract of mean of x for more accurate distance computations
if not sp.issparse(X) or hasattr(init, '__array__'):
X_mean = X.mean(axis=0)
if not sp.issparse(X):
# The copy was already done above
X -= X_mean
if hasattr(init, '__array__'):
init = check_array(init, dtype=X.dtype.type, copy=True)
_validate_center_shape(X, n_clusters, init)
init -= X_mean
if n_init != 1:
warnings.warn(
'Explicit initial center position passed: '
'performing only one init in k-means instead of n_init=%d'
% n_init, RuntimeWarning, stacklevel=2)
n_init = 1
# precompute squared norms of data points
x_squared_norms = row_norms(X, squared=True)
best_labels, best_inertia, best_centers = None, None, None
if n_clusters == 1:
# elkan doesn't make sense for a single cluster, full will produce
# the right result.
algorithm = "full"
if algorithm == "auto":
algorithm = "full" if sp.issparse(X) else 'elkan'
if algorithm == "full":
kmeans_single = _kmeans_single_lloyd
elif algorithm == "elkan":
kmeans_single = _kmeans_single_elkan
else:
raise ValueError("Algorithm must be 'auto', 'full' or 'elkan', got"
" %s" % str(algorithm))
if n_jobs == 1:
# For a single thread, less memory is needed if we just store one set
# of the best results (as opposed to one set per run per thread).
for it in range(n_init):
# run a k-means once
labels, inertia, centers, n_iter_ = kmeans_single(
X, n_clusters, max_iter=max_iter, init=init, verbose=verbose,
precompute_distances=precompute_distances, tol=tol,
x_squared_norms=x_squared_norms, random_state=random_state)
# determine if these results are the best so far
if best_inertia is None or inertia < best_inertia:
best_labels = labels.copy()
best_centers = centers.copy()
best_inertia = inertia
best_n_iter = n_iter_
else:
# parallelisation of k-means runs
seeds = random_state.randint(np.iinfo(np.int32).max, size=n_init)
results = Parallel(n_jobs=n_jobs, verbose=0)(
delayed(kmeans_single)(X, n_clusters, max_iter=max_iter, init=init,
verbose=verbose, tol=tol,
precompute_distances=precompute_distances,
x_squared_norms=x_squared_norms,
# Change seed to ensure variety
random_state=seed)
for seed in seeds)
# Get results with the lowest inertia
labels, inertia, centers, n_iters = zip(*results)
best = np.argmin(inertia)
best_labels = labels[best]
best_inertia = inertia[best]
best_centers = centers[best]
best_n_iter = n_iters[best]
if not sp.issparse(X):
if not copy_x:
X += X_mean
best_centers += X_mean
if return_n_iter:
return best_centers, best_labels, best_inertia, best_n_iter
else:
return best_centers, best_labels, best_inertia
这篇关于聚类分析 - sklearn的kmeans使用的是哪种距离度量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文