Apache Spark和scikit_learn之间的KMeans结果不一致 [英] Inconsistent results with KMeans between Apache Spark and scikit_learn

查看:128
本文介绍了Apache Spark和scikit_learn之间的KMeans结果不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PySpark对数据集执行聚类.为了找到簇的数量,我对值范围(2,20)进行了聚类,并为每个k值找到了wsse(簇内平方和)值.这是我发现异常的地方.根据我的理解,当您增加群集数量时,wsse单调减少.但是结果我不这么说.我仅在前几个群集中显示wsse

I am performing clustering on a dataset using PySpark. To find the number of clusters I performed clustering over a range of values (2,20) and found the wsse (within-cluster sum of squares) values for each value of k. This where I found something unusual. According to my understanding when you increase the number of clusters, the wsse decreases monotonically. But results I got say otherwise. I 'm displaying wsse for first few clusters only

Results from spark

For k = 002 WSSE is 255318.793358
For k = 003 WSSE is 209788.479560
For k = 004 WSSE is 208498.351074
For k = 005 WSSE is 142573.272672
For k = 006 WSSE is 154419.027612
For k = 007 WSSE is 115092.404604
For k = 008 WSSE is 104753.205635
For k = 009 WSSE is 98000.985547
For k = 010 WSSE is 95134.137071

如果查看k=5k=6wsse值,您会看到wsse增加了.我转向sklearn看看是否得到了类似的结果.我用于spark和sklearn的代码位于帖子末尾的附录部分.我试图对spark和sklearn KMeans模型中的参数使用相同的值.以下是sklearn的结果,它们与我预期的一样-单调递减.

If you look at the wsse value of for k=5 and k=6, you'll see the wsse has increased. I turned to sklearn to see if I get similar results. The codes I used for spark and sklearn are in the appendix section towards the end of the post. I have tried to use same values for the parameters in spark and sklearn KMeans model. The following are the results from sklearn and they are as I expected them to be - monotonically decreasing.

Results from sklearn

For k = 002 WSSE is 245090.224247
For k = 003 WSSE is 201329.888159
For k = 004 WSSE is 166889.044195
For k = 005 WSSE is 142576.895154
For k = 006 WSSE is 123882.070776
For k = 007 WSSE is 112496.692455
For k = 008 WSSE is 102806.001664
For k = 009 WSSE is 95279.837212
For k = 010 WSSE is 89303.574467

我不确定为什么wsse值在Spark中会增加.我尝试使用不同的数据集,并且在那里也发现了类似的行为.我在哪里出错了?任何线索都很好.

I am not sure as to why I the wsse values increase in Spark. I tried using different datasets and found similar behavior there as well. Is there someplace I am going wrong? Any clues would be great.


APPENDIX

数据集位于 此处.

读取数据并设置声明变量

# get data
import pandas as pd
url = "https://raw.githubusercontent.com/vectosaurus/bb_lite/master/3.0%20data/adult_comp_cont.csv"

df_pandas = pd.read_csv(url)
df_spark = sqlContext(df_pandas)
target_col = 'high_income'
numeric_cols = [i for i in df_pandas.columns if i !=target_col]

k_min = 2 # 2 in inclusive
k_max = 21 # 2i is exlusive. will fit till 20

max_iter = 1000
seed = 42    

这是我用来获取sklearn结果的代码:

from sklearn.cluster import KMeans as KMeans_SKL
from sklearn.preprocessing import StandardScaler as StandardScaler_SKL

ss = StandardScaler_SKL(with_std=True, with_mean=True)
ss.fit(df_pandas.loc[:, numeric_cols])
df_pandas_scaled = pd.DataFrame(ss.transform(df_pandas.loc[:, numeric_cols]))

wsse_collect = []

for i in range(k_min, k_max):
    km = KMeans_SKL(random_state=seed, max_iter=max_iter, n_clusters=i)
    _ = km.fit(df_pandas_scaled)
    wsse = km.inertia_
    print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse))
    wsse_collect.append(wsse)

这是我用于获取火花结果的代码

from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml.clustering import KMeans

standard_scaler_inpt_features = 'ss_features'
kmeans_input_features = 'features'
kmeans_prediction_features = 'prediction'


assembler = VectorAssembler(inputCols=numeric_cols, outputCol=standard_scaler_inpt_features)
assembled_df = assembler.transform(df_spark)

scaler = StandardScaler(inputCol=standard_scaler_inpt_features, outputCol=kmeans_input_features, withStd=True, withMean=True)
scaler_model = scaler.fit(assembled_df)
scaled_data = scaler_model.transform(assembled_df)

wsse_collect_spark = []

for i in range(k_min, k_max):
    km = KMeans(featuresCol=kmeans_input_features, predictionCol=kmeans_prediction_col,
                        k=i, maxIter=max_iter, seed=seed)
    km_fit = km.fit(scaled_data)
    wsse_spark = km_fit.computeCost(scaled_data)
    wsse_collect_spark .append(wsse_spark)
    print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse_spark))

更新

根据@Michail N的回答,我更改了Spark KMeans模型的tolmaxIter值.我重新运行了代码,但看到相同的行为在重复.但是自从Michail提到

Following @Michail N's answer, I changed the tol and maxIter values for the Spark KMeans model. I re-ran the code but I saw the same behavior repeating. But since Michail mentioned

Spark MLlib实际上实现了K-means ||

Spark MLlib, in fact, implements K-means||

我将initSteps的数量增加了50倍,然后重新运行该过程,结果如下.

I increased the number of initSteps by a factor of 50 and re-ran the process which gave the following results.

For k = 002 WSSE is 255318.718684
For k = 003 WSSE is 212364.906298
For k = 004 WSSE is 185999.709027
For k = 005 WSSE is 168616.028321                                               
For k = 006 WSSE is 123879.449228                                               
For k = 007 WSSE is 113646.930680                                               
For k = 008 WSSE is 102803.889178                                               
For k = 009 WSSE is 97819.497501                                                
For k = 010 WSSE is 99973.198132                                                
For k = 011 WSSE is 89103.510831                                                
For k = 012 WSSE is 84462.110744                                                
For k = 013 WSSE is 78803.619605                                                
For k = 014 WSSE is 82174.640611                                                
For k = 015 WSSE is 79157.287447                                                
For k = 016 WSSE is 75007.269644                                                
For k = 017 WSSE is 71610.292172                                                
For k = 018 WSSE is 68706.739299                                                
For k = 019 WSSE is 65440.906151                                                
For k = 020 WSSE is 66396.106118

k=5k=6wsse的增加消失了.尽管如果您查看k=13k=14以及其他内容,该行为仍然存在,但是至少我知道了它的来源.

The increase of wsse from k=5 and k=6 disappears. Although the behavior persists if you look at k=13 and k=14 and elsewhere, but atleast I got to know where this was coming from.

推荐答案

WSSE没有单调递减是没有问题的.从理论上讲,如果聚类最佳,则WSSE必须单调减少,这意味着从所有可能的k中心聚类中,具有最佳WSSE的聚类.

There is nothing wrong with WSSE not decreasing monotonically. In theory WSSE must decrease monotonically if the cluster is optimal, that means that from all the possible k-centers clusters the one with the best WSSE.

问题在于K均值不一定能够找到最佳聚类 对于给定的k.它的迭代过程可以从一个随机的起点收敛到 局部最小值,可能很好,但不是最佳值.

The problem is that K-means is not necessarily able to find the optimal clustering for a given k. Its iterative process can converge from a random starting point to a local minimum, which may be good but is not optimal.

有些方法类似于 K-means ++ 和Kmeans ||具有选择算法变体的变量,它们更有可能选择不同的,分离的质心并更可靠地导致良好的聚类,而Spark MLlib实际上实现了K-means ||.但是,所有人的选择仍然具有随机性,不能保证最佳的聚类.

There are methods like K-means++ and Kmeans|| that have variants of selection algorithms that are more likely to choose diverse, separated centroids and lead more reliably to a good clustering and Spark MLlib, in fact, implements K-means||. However, all still have an element of randomness in selection and can’t guarantee an optimal clustering.

为k = 6选择的随机开始聚类集可能导致特别次优的聚类,或者可能在达到局部最优值之前就已经停止了.

The random starting set of clusters chosen for k=6 perhaps led to a particularly suboptimal clustering, or it may have stopped early before it reached its local optimum.

您可以通过更改

You can improve it by changing the parameters of Kmeans manually. The algorithm has a threshold via tol that controls the minimum amount of cluster centroid movement considered significant, where lower values mean the K-means algorithm will let the centroids continue to move longer.

使用maxIter增加最大迭代次数还可以防止它以可能需要更多计算为代价而过早停止.

Increasing the maximum number of iterations with maxIter also prevents it from potentially stopping too early at the cost of possibly more computation.

所以我的建议是使用

 ...
 #increase from default 20
 max_iter= 40     
 #decrase from default 0.0001
 tol = 0.00001 
 km = KMeans(featuresCol=kmeans_input_features, predictionCol=kmeans_prediction_col, k=i, maxIter=max_iter, seed=seed , tol = tol )
 ...

这篇关于Apache Spark和scikit_learn之间的KMeans结果不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆