如何在 scikit learn 中使用核密度估计作为一维聚类方法? [英] How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

查看:36
本文介绍了如何在 scikit learn 中使用核密度估计作为一维聚类方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将一个简单的单变量数据集聚类为预设数量的聚类.从技术上讲,它更接近于对数据进行分箱或排序,因为它只是一维数据,但我的老板称其为聚类,所以我将坚持使用这个名称.我使用的系统当前使用的方法是 K-means,但这似乎有点过分.

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The current method used by the system I'm on is K-means, but that seems like overkill.

是否有更好的方法来执行此任务?

Is there a better way of performing this task?

其他一些帖子的答案提到了 KDE(核密度估计),但那是一种密度估计方法,它是如何工作的?

Answers to some other posts are mentioning KDE (Kernel Density Estimation), but that is a density estimation method, how would that work?

我看到 KDE 如何返回密度,但我如何告诉它将数据拆分为多个 bin?

I see how KDE returns a density, but how do I tell it to split the data into bins?

我如何拥有独立于数据的固定数量的 bin(这是我的要求之一)?

How do I have a fixed number of bins independent of the data (that's one of my requirements) ?

更具体地说,如何使用 scikit learn 实现这一目标?

More specifically, how would one pull this off using scikit learn?

我的输入文件看起来像:

My input file looks like:

 str ID     sls
 1           10
 2           11 
 3            9
 4           23
 5           21
 6           11  
 7           45
 8           20
 9           11
 10          12

我想将 sls 编号分组到集群或垃圾箱中,例如:

I want to group the sls number into clusters or bins, such that:

Cluster 1: [10 11 9 11 11 12] 
Cluster 2: [23 21 20] 
Cluster 3: [45] 

我的输出文件将如下所示:

And my output file will look like:

 str ID     sls    Cluster ID  Cluster centroid
    1        10       1               10.66
    2        11       1               10.66
    3         9       1               10.66 
    4        23       2               21.33   
    5        21       2               21.33
    6        11       1               10.66
    7        45       3               45
    8        20       2               21.33
    9        11       1               10.66 
    10       12       1               10.66

推荐答案

自己编写代码.那么它最适合您的问题!

Write code yourself. Then it fits your problem best!

样板:永远不要假设您从网上下载的代码是正确的或最佳的……确保在使用之前完全理解它.

%matplotlib inline

from numpy import array, linspace
from sklearn.neighbors.kde import KernelDensity
from matplotlib.pyplot import plot

a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,50)
e = kde.score_samples(s.reshape(-1,1))
plot(s, e)

from scipy.signal import argrelextrema
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print "Minima:", s[mi]
print "Maxima:", s[ma]
> Minima: [ 17.34693878  33.67346939]
> Maxima: [ 10.20408163  21.42857143  44.89795918]

因此您的集群是

print a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]]
> [10 11  9 11 11 12] [23 21 20] [45]

在视觉上,我们进行了这种拆分:

and visually, we did this split:

plot(s[:mi[0]+1], e[:mi[0]+1], 'r',
     s[mi[0]:mi[1]+1], e[mi[0]:mi[1]+1], 'g',
     s[mi[1]:], e[mi[1]:], 'b',
     s[ma], e[ma], 'go',
     s[mi], e[mi], 'ro')

我们切入红色标记处.绿色标记是我们对聚类中心的最佳估计.

We cut at the red markers. The green markers are our best estimates for the cluster centers.

这篇关于如何在 scikit learn 中使用核密度估计作为一维聚类方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆