Scikit-learn，GMM:从 .means_ 属性返回的问题 [英] Scikit-learn, GMM: Issue with return from .means_ attribute

查看：180 发布时间：2021/6/10 19:37:57 python numpy scikit-learn gmm

本文介绍了Scikit-learn，GMM:从 .means_ 属性返回的问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

很明显.. means_ 属性返回的结果与我为每个集群计算的平均值不同.(或者我对返回的内容有错误的理解！)

So apparently.. the means_ attribute returns different results from the means I calculated per each cluster. (or I have a wrong understanding of what this returns!)

以下是我编写的代码，用于检查 GMM 如何适合我拥有的时间序列数据.

Following is the code I wrote to check how GMM fits to the time series data I have.

import numpy as np
import pandas as pd
import seaborn as sns
import time
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.mixture import BayesianGaussianMixture
from sklearn.mixture import GaussianMixture


toc = time.time()

input 包含(米数/样本数)x(特征数)

input contains (# of meters/samples) x (# of features)

read = pd.read_csv('input', sep='\t', index_col= 0, header =0, \
               names =['meter', '6:30', '9:00', '15:30', '22:30', 'std_year', 'week_score', 'season_score'], \
               encoding= 'utf-8')
read.drop('meter', 1, inplace=True)
read['std_year'] = read['std_year'].divide(4).round(2)

input = read.as_matrix(columns=['6:30', '9:00', '15:30', '22:30',])

将其放入 GMM，有 10 个集群.(使用 BIC 图，5 是得分最低的最佳数字......但在 -7,000.在与我的顾问讨论后，这并非不可能，但仍然很奇怪.)

fit it into GMM, with 10 clusters. (using the BIC plot, 5 was the optimal number with the lowest score..but at -7,000. It isn't impossible, after a discussion with my advisor but still it is weird. )

gmm = GaussianMixture(n_components=10, covariance_type ='full', \
                  init_params = 'random', max_iter = 100, random_state=0)
gmm.fit(input)
print(gmm.means_.round(2))
cluster = gmm.predict(input)

我在下面所做的是使用从 .predict 返回的标签手动计算每个集群的质心/中心 - 如果使用这些术语来表示平均向量是正确的.

What I do in the following is to calculate manually the centroid/center - if it is correct to use these terms to indicate mean vectors - of each cluster, using the labels returned from .predict.

具体来说，cluster 包含一个从 0 到 9 的值，每个值表示集群.我将其转置并连接到 (# of samples) x (# of attributes) 的输入矩阵作为数组.我想利用pandas库处理这么大数据的方便，所以把它变成一个dataframe.

To be specific, cluster contains a value from 0 to 9 each indicating the cluster. I transpose this and concatenate to the input matrix of (# of samples) x (# of attributes) as an array. I want to make use of the pandas library's easiness in handling such big data, so turn it into a dataframe.

cluster = np.array(cluster).reshape(-1,1) #(3488, 1)
ret = np.concatenate((cluster, input), axis=1) #(3488, 5)
ret_pd = pd.DataFrame(ret, columns=['label','6:30', '9:00', '15:30', '22:30'])
ret_pd['label'] = ret_pd['label'].astype(int)

对于每个仪表的特征，它的集群分类在标签"列下.因此，以下代码对每个标签进行聚类，然后按列取平均值.

For each meter's features, its cluster is classified under the column 'label'. So the following code clusters per each label and then I take the mean by column.

cluster_mean = []
for label in range(10):
#take mean by columns per each cluster
    segment= ret_pd[ret_pd['label']== label]
    print(segment)
    turn = np.array(segment)[:, 1:]
    print(turn.shape)
    mean_ = np.mean(turn, axis =0).round(2) #series
    print(mean_)
    plt.plot(np.array(mean_), label='cluster %s' %label) 

    cluster_mean.append(list(mean_))

print(cluster_mean)

xvalue = ['6:30', '9:00', '15:30', '22:30']
plt.ylabel('Energy Use [kWh]')
plt.xlabel('time of day')
plt.xticks(range(4), xvalue)
plt.legend(loc = 'upper center', bbox_to_anchor = (0.5, 1.05),\
       ncol =2, fancybox =True, shadow= True)
plt.savefig('cluster_gmm_100.png')

tic = time.time()
print('time ', tic-toc)

有趣的是，内部库中的 .means_ 返回的值与我在此代码中计算的值不同.

What is interesting is that the .means_ from the internal library returns different values from what I calculate in this code.

Scikit-learn 的 .means_:

Scikit-learn's .means_:

[[ 0.46  1.42  1.12  1.35]
 [ 0.49  0.78  1.19  1.49]
 [ 0.49  0.82  1.01  1.63]
 [ 0.6   0.77  0.99  1.55]
 [ 0.78  0.75  0.92  1.42]
 [ 0.58  0.68  1.03  1.57]
 [ 0.4   0.96  1.25  1.47]
 [ 0.69  0.83  0.98  1.43]
 [ 0.55  0.96  1.03  1.5 ]
 [ 0.58  1.01  1.01  1.47]]

我的结果:

[[0.45000000000000001, 1.6599999999999999, 1.1100000000000001, 1.29],    
 [0.46000000000000002, 0.73999999999999999, 1.26, 1.48], 
[0.45000000000000001, 0.80000000000000004, 0.92000000000000004, 1.78], 
[0.68000000000000005, 0.72999999999999998, 0.85999999999999999, 1.5900000000000001], 
[0.91000000000000003, 0.68000000000000005, 0.84999999999999998, 1.3600000000000001], 
[0.58999999999999997, 0.65000000000000002, 1.02, 1.5900000000000001], 
[0.35999999999999999, 1.03, 1.28, 1.46], 
[0.77000000000000002, 0.88, 0.94999999999999996, 1.3500000000000001], 
[0.53000000000000003, 1.0700000000000001, 0.97999999999999998, 1.53], 
[0.66000000000000003, 1.21, 0.95999999999999996, 1.3600000000000001]]

另一方面，我不确定为什么我返回的结果没有正确四舍五入到 2 个十进制数字..

As a side, I'm not sure why the results I return are not rounded to 2 decimal digits properly..

Scikit-learn，GMM:从 .means_ 属性返回的问题 [英] Scikit-learn, GMM: Issue with return from .means_ attribute

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scikit-learn，GMM:从 .means_ 属性返回的问题 [英] Scikit-learn, GMM: Issue with return from .means_ attribute

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭