scikit Learn(Python)中的Meanshift无法理解数据类型 [英] Meanshift in scikit learn (python) doesn't understand datatype
问题描述
我有一个包含 7265个样本和 132个特征的数据集. 我想使用scikit learning中的meanshift算法,但是遇到了这个错误:
I have a dataset which has 7265 samples and 132 features. I want to use the meanshift algorithm from scikit learn but I ran into this error:
Traceback (most recent call last):
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 130, in <module>
labels, centers = getClusters(data,clusters)
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 34, in getClusters
ms.fit(np.array(dataarray))
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 280, in fit
cluster_all=self.cluster_all)
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 137, in mean_shift
nbrs = NearestNeighbors(radius=bandwidth).fit(sorted_centers)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 642, in fit
return self._fit(X)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
我的代码:
dataarray = np.array(data)
bandwidth = estimate_bandwidth(dataarray, quantile=0.2, n_samples=len(dataarray))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(dataarray)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
如果我检查数据变量的数据类型,我会看到:
If I check the datatype of the data variable I see:
print isinstance( dataarray, np.ndarray )
>>> True
带宽为0.925538333061,dataarray.dtype
为float64
The bandwidth is 0.925538333061 and the dataarray.dtype
is float64
我正在使用scikit学习0.14.1
I'm using scikit learn 0.14.1
我可以在sci-kit中使用其他算法(尝试过kmeans和dbscan)进行聚类.我在做什么错了?
I can cluster with other algorithms in sci-kit (tried kmeans and dbscan). What am I doing wrong ?
数据可在此处找到: (点刺格式): http://ojtwist.be/datatocluster.p 和: http://ojtwist.be/datatocluster.npz
The data can be found here: (pickle format) : http://ojtwist.be/datatocluster.p and : http://ojtwist.be/datatocluster.npz
推荐答案
那是scikit项目中的错误.在此处记录.
That`s a bug in scikit project. It is documented here.
在装配过程中存在一个float-> int铸造,在某些情况下可能会崩溃(通过将种子点放置在垃圾箱的角处而不是在中心).链接中有一些代码可以解决此问题.
There is a float -> int casting during the fitting process that can crash in some cases (by making the seed points be placed at the corner of the bins instead in the center). There is some code in the link to fix the problem.
如果您不想使用scikit代码(并保持代码与其他计算机之间的兼容性),建议您在将数据传递给MeanShift之前对其进行标准化.
If you don't wanna get your hands into the scikit code (and maintain compatibility between your code with other machines) i suggest you normalize your data before passing it to MeanShift.
尝试一下:
>>>from sklearn import preprocessing
>>>data2 = preprocessing.scale(dataarray)
然后在您的代码中使用data2. 它对我有用.
And then use data2 into your code. It worked for me.
如果您不想执行任何一种解决方案,那么这是为项目做出贡献的绝佳机会,并向解决方案提出拉动请求:)
If you don't want to do either solution, it is a great opportunity to contribute to the project, making a pull request with the solution :)
您可能希望保留信息以去除"均值漂移的结果.因此,请使用 StandardScaler 对象,而不是使用缩放功能.
You probably want to retain information to "descale" the results of meanshift. So, use a StandardScaler object, instead using a function to scale.
祝你好运!
这篇关于scikit Learn(Python)中的Meanshift无法理解数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!