Scikit-learn,GMM:从 .means_ 属性返回的问题 [英] Scikit-learn, GMM: Issue with return from .means_ attribute

查看:180
本文介绍了Scikit-learn,GMM:从 .means_ 属性返回的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很明显.. means_ 属性返回的结果与我为每个集群计算的平均值不同.(或者我对返回的内容有错误的理解!)

So apparently.. the means_ attribute returns different results from the means I calculated per each cluster. (or I have a wrong understanding of what this returns!)

以下是我编写的代码,用于检查 GMM 如何适合我拥有的时间序列数据.

Following is the code I wrote to check how GMM fits to the time series data I have.

import numpy as np
import pandas as pd
import seaborn as sns
import time
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.mixture import BayesianGaussianMixture
from sklearn.mixture import GaussianMixture


toc = time.time()

input 包含(米数/样本数)x(特征数)

input contains (# of meters/samples) x (# of features)

read = pd.read_csv('input', sep='\t', index_col= 0, header =0, \
               names =['meter', '6:30', '9:00', '15:30', '22:30', 'std_year', 'week_score', 'season_score'], \
               encoding= 'utf-8')
read.drop('meter', 1, inplace=True)
read['std_year'] = read['std_year'].divide(4).round(2)

input = read.as_matrix(columns=['6:30', '9:00', '15:30', '22:30',])

将其放入 GMM,有 10 个集群.(使用 BIC 图,5 是得分最低的最佳数字......但在 -7,000.在与我的顾问讨论后,这并非不可能,但仍然很奇怪.)

fit it into GMM, with 10 clusters. (using the BIC plot, 5 was the optimal number with the lowest score..but at -7,000. It isn't impossible, after a discussion with my advisor but still it is weird. )

gmm = GaussianMixture(n_components=10, covariance_type ='full', \
                  init_params = 'random', max_iter = 100, random_state=0)
gmm.fit(input)
print(gmm.means_.round(2))
cluster = gmm.predict(input)

我在下面所做的是使用从 .predict 返回的标签手动计算每个集群的质心/中心 - 如果使用这些术语来表示平均向量是正确的.

What I do in the following is to calculate manually the centroid/center - if it is correct to use these terms to indicate mean vectors - of each cluster, using the labels returned from .predict.

具体来说,cluster 包含一个从 0 到 9 的值,每个值表示集群.我将其转置并连接到 (# of samples) x (# of attributes) 的输入矩阵作为数组.我想利用pandas库处理这么大数据的方便,所以把它变成一个dataframe.

To be specific, cluster contains a value from 0 to 9 each indicating the cluster. I transpose this and concatenate to the input matrix of (# of samples) x (# of attributes) as an array. I want to make use of the pandas library's easiness in handling such big data, so turn it into a dataframe.

cluster = np.array(cluster).reshape(-1,1) #(3488, 1)
ret = np.concatenate((cluster, input), axis=1) #(3488, 5)
ret_pd = pd.DataFrame(ret, columns=['label','6:30', '9:00', '15:30', '22:30'])
ret_pd['label'] = ret_pd['label'].astype(int)

对于每个仪表的特征,它的集群分类在标签"列下.因此,以下代码对每个标签进行聚类,然后按列取平均值.

For each meter's features, its cluster is classified under the column 'label'. So the following code clusters per each label and then I take the mean by column.

cluster_mean = []
for label in range(10):
#take mean by columns per each cluster
    segment= ret_pd[ret_pd['label']== label]
    print(segment)
    turn = np.array(segment)[:, 1:]
    print(turn.shape)
    mean_ = np.mean(turn, axis =0).round(2) #series
    print(mean_)
    plt.plot(np.array(mean_), label='cluster %s' %label) 

    cluster_mean.append(list(mean_))

print(cluster_mean)

xvalue = ['6:30', '9:00', '15:30', '22:30']
plt.ylabel('Energy Use [kWh]')
plt.xlabel('time of day')
plt.xticks(range(4), xvalue)
plt.legend(loc = 'upper center', bbox_to_anchor = (0.5, 1.05),\
       ncol =2, fancybox =True, shadow= True)
plt.savefig('cluster_gmm_100.png')

tic = time.time()
print('time ', tic-toc)

有趣的是,内部库中的 .means_ 返回的值与我在此代码中计算的值不同.

What is interesting is that the .means_ from the internal library returns different values from what I calculate in this code.

Scikit-learn 的 .means_:

Scikit-learn's .means_:

[[ 0.46  1.42  1.12  1.35]
 [ 0.49  0.78  1.19  1.49]
 [ 0.49  0.82  1.01  1.63]
 [ 0.6   0.77  0.99  1.55]
 [ 0.78  0.75  0.92  1.42]
 [ 0.58  0.68  1.03  1.57]
 [ 0.4   0.96  1.25  1.47]
 [ 0.69  0.83  0.98  1.43]
 [ 0.55  0.96  1.03  1.5 ]
 [ 0.58  1.01  1.01  1.47]]

我的结果:

[[0.45000000000000001, 1.6599999999999999, 1.1100000000000001, 1.29],    
 [0.46000000000000002, 0.73999999999999999, 1.26, 1.48], 
[0.45000000000000001, 0.80000000000000004, 0.92000000000000004, 1.78], 
[0.68000000000000005, 0.72999999999999998, 0.85999999999999999, 1.5900000000000001], 
[0.91000000000000003, 0.68000000000000005, 0.84999999999999998, 1.3600000000000001], 
[0.58999999999999997, 0.65000000000000002, 1.02, 1.5900000000000001], 
[0.35999999999999999, 1.03, 1.28, 1.46], 
[0.77000000000000002, 0.88, 0.94999999999999996, 1.3500000000000001], 
[0.53000000000000003, 1.0700000000000001, 0.97999999999999998, 1.53], 
[0.66000000000000003, 1.21, 0.95999999999999996, 1.3600000000000001]]

另一方面,我不确定为什么我返回的结果没有正确四舍五入到 2 个十进制数字..

As a side, I'm not sure why the results I return are not rounded to 2 decimal digits properly..

推荐答案

虽然我不完全确定您的代码在做什么,但我相当确定问题出在哪里.

Though I'm not completely sure of what your code is doing, I fairly sure what the problem is here.

means_ 返回的参数是构成模型的参数(高斯)分布的均值.当您通过计算每个组件中聚集的所有数据的平均值来计算平均值时,这几乎总是会给出不同的(尽管结果相似).为了更好地理解为什么这些可能不同,我建议阅读更多关于 scikit-learn 用于拟合 GMM 的期望最大化算法.

The parameters returned by means_ are the means of the parametric (Gaussian) distributions that make up the model. Where as when you are calculating the means you are doing it by taking the average of all data that is clustered in each component, this will almost always give different (though similar results). To get a better understanding of why these might differ I would suggest reading a bit more about the Expectation maximization algorithm that scikit-learn uses to fit GMM's.

这篇关于Scikit-learn,GMM:从 .means_ 属性返回的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆