scikit Learn(Python)中的Meanshift无法理解数据类型 [英] Meanshift in scikit learn (python) doesn't understand datatype

查看:130
本文介绍了scikit Learn(Python)中的Meanshift无法理解数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 7265个样本 132个特征的数据集. 我想使用scikit learning中的meanshift算法,但是遇到了这个错误:

I have a dataset which has 7265 samples and 132 features. I want to use the meanshift algorithm from scikit learn but I ran into this error:

Traceback (most recent call last):
  File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 130, in <module>
    labels, centers = getClusters(data,clusters)
  File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 34, in getClusters
    ms.fit(np.array(dataarray))
  File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 280, in fit
    cluster_all=self.cluster_all)
  File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 137, in mean_shift
    nbrs = NearestNeighbors(radius=bandwidth).fit(sorted_centers)
  File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 642, in fit
    return self._fit(X)
  File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 180, in _fit
    raise ValueError("data type not understood")
ValueError: data type not understood

我的代码:

dataarray = np.array(data)
bandwidth = estimate_bandwidth(dataarray, quantile=0.2, n_samples=len(dataarray))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(dataarray)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

如果我检查数据变量的数据类型,我会看到:

If I check the datatype of the data variable I see:

print isinstance( dataarray, np.ndarray )
>>> True

带宽为0.925538333061,dataarray.dtypefloat64

The bandwidth is 0.925538333061 and the dataarray.dtype is float64

我正在使用scikit学习0.14.1

I'm using scikit learn 0.14.1

我可以在sci-kit中使用其他算法(尝试过kmeans和dbscan)进行聚类.我在做什么错了?

I can cluster with other algorithms in sci-kit (tried kmeans and dbscan). What am I doing wrong ?

数据可在此处找到: (点刺格式): http://ojtwist.be/datatocluster.p 和: http://ojtwist.be/datatocluster.npz

The data can be found here: (pickle format) : http://ojtwist.be/datatocluster.p and : http://ojtwist.be/datatocluster.npz

推荐答案

那是scikit项目中的错误.在此处记录.

That`s a bug in scikit project. It is documented here.

在装配过程中存在一个float-> int铸造,在某些情况下可能会崩溃(通过将种子点放置在垃圾箱的角处而不是在中心).链接中有一些代码可以解决此问题.

There is a float -> int casting during the fitting process that can crash in some cases (by making the seed points be placed at the corner of the bins instead in the center). There is some code in the link to fix the problem.

如果您不想使用scikit代码(并保持代码与其他计算机之间的兼容性),建议您在将数据传递给MeanShift之前对其进行标准化.

If you don't wanna get your hands into the scikit code (and maintain compatibility between your code with other machines) i suggest you normalize your data before passing it to MeanShift.

尝试一下:

>>>from sklearn import preprocessing
>>>data2 = preprocessing.scale(dataarray)

然后在您的代码中使用data2. 它对我有用.

And then use data2 into your code. It worked for me.

如果您不想执行任何一种解决方案,那么这是为项目做出贡献的绝佳机会,并向解决方案提出拉动请求:)

If you don't want to do either solution, it is a great opportunity to contribute to the project, making a pull request with the solution :)

您可能希望保留信息以去除"均值漂移的结果.因此,请使用 StandardScaler 对象,而不是使用缩放功能.

You probably want to retain information to "descale" the results of meanshift. So, use a StandardScaler object, instead using a function to scale.

祝你好运!

这篇关于scikit Learn(Python)中的Meanshift无法理解数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆