如何将内核密度估计作为scikit中的一维聚类方法来学习? [英] How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

查看:145
本文介绍了如何将内核密度估计作为scikit中的一维聚类方法来学习?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将一个简单的单变量数据集聚为预设数量的聚类.从技术上讲,由于它只是一维的,因此它更接近于对数据进行分类或排序,但是我的老板称其为聚类,因此我将继续使用该名称. 我正在使用的系统当前使用的方法是K-means,但这似乎有些过分.

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The current method used by the system I'm on is K-means, but that seems like overkill.

是否有更好的方法来执行此任务?

Is there a better way of performing this task?

其他一些帖子的答案中提到了KDE(内核密度估计),但这是一种密度估计方法,这将如何工作?

Answers to some other posts are mentioning KDE (Kernel Density Estimation), but that is a density estimation method, how would that work?

我知道KDE如何返回密度,但是如何告诉它将数据拆分为bin?

I see how KDE returns a density, but how do I tell it to split the data into bins?

我如何有独立于数据的固定数量的容器(这是我的要求之一)?

How do I have a fixed number of bins independent of the data (that's one of my requirements) ?

更具体地说,如何使用scikit做到这一点?

More specifically, how would one pull this off using scikit learn?

我的输入文件如下:

 str ID     sls
 1           10
 2           11 
 3            9
 4           23
 5           21
 6           11  
 7           45
 8           20
 9           11
 10          12

我想将sls号分组为簇或bin,例如:

I want to group the sls number into clusters or bins, such that:

Cluster 1: [10 11 9 11 11 12] 
Cluster 2: [23 21 20] 
Cluster 3: [45] 

我的输出文件将如下所示:

And my output file will look like:

 str ID     sls    Cluster ID  Cluster centroid
    1        10       1               10.66
    2        11       1               10.66
    3         9       1               10.66 
    4        23       2               21.33   
    5        21       2               21.33
    6        11       1               10.66
    7        45       3               45
    8        20       2               21.33
    9        11       1               10.66 
    10       12       1               10.66

推荐答案

自己编写代码.然后最适合您的问题!

Write code yourself. Then it fits your problem best!

样板:切勿以为从网上下载的代码是正确的或最佳的...请确保在使用前完全理解它.

%matplotlib inline

from numpy import array, linspace
from sklearn.neighbors.kde import KernelDensity
from matplotlib.pyplot import plot

a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,50)
e = kde.score_samples(s.reshape(-1,1))
plot(s, e)

from scipy.signal import argrelextrema
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print "Minima:", s[mi]
print "Maxima:", s[ma]
> Minima: [ 17.34693878  33.67346939]
> Maxima: [ 10.20408163  21.42857143  44.89795918]

因此,您的集群是

print a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]]
> [10 11  9 11 11 12] [23 21 20] [45]

在视觉上,我们进行了拆分:

and visually, we did this split:

plot(s[:mi[0]+1], e[:mi[0]+1], 'r',
     s[mi[0]:mi[1]+1], e[mi[0]:mi[1]+1], 'g',
     s[mi[1]:], e[mi[1]:], 'b',
     s[ma], e[ma], 'go',
     s[mi], e[mi], 'ro')

我们在红色标记处切开.绿色标记是我们对聚类中心的最佳估计.

We cut at the red markers. The green markers are our best estimates for the cluster centers.

这篇关于如何将内核密度估计作为scikit中的一维聚类方法来学习?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆