PySpark ML:获取KMeans集群统计信息 [英] PySpark ML: Get KMeans cluster statistics

查看:87
本文介绍了PySpark ML:获取KMeans集群统计信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我建立了一个KMeansModel.我的结果存储在名为 transformed.

I have built a KMeansModel. My results are stored in a PySpark DataFrame called transformed.

(a)如何解释transformed的内容?

(a) How do I interpret the contents of transformed?

(b)如何从transformed创建一个或多个Pandas DataFrame,以显示14个群集中每个13个功能部件的摘要统计信息?

(b) How do I create one or more Pandas DataFrame from transformed that would show summary statistics for each of the 13 features for each of the 14 clusters?

from pyspark.ml.clustering import KMeans
# Trains a k-means model.
kmeans = KMeans().setK(14).setSeed(1)
model = kmeans.fit(X_spark_scaled) # Fits a model to the input dataset with optional parameters.

transformed = model.transform(X_spark_scaled).select("features", "prediction") # X_spark_scaled is my PySpark DataFrame consisting of 13 features
transformed.show(5, truncate = False)
+------------------------------------------------------------------------------------------------------------------------------------+----------+
|features                                                                                                                            |prediction|
+------------------------------------------------------------------------------------------------------------------------------------+----------+
|(14,[4,5,7,8,9,13],[1.0,1.0,485014.0,0.25,2.0,1.0])                                                                                 |12        |
|(14,[2,7,8,9,12,13],[1.0,2401233.0,1.0,1.0,1.0,1.0])                                                                                |2         |
|(14,[2,4,5,7,8,9,13],[0.3333333333333333,0.6666666666666666,0.6666666666666666,2429111.0,0.9166666666666666,1.3333333333333333,3.0])|2         |
|(14,[4,5,7,8,9,12,13],[1.0,1.0,2054748.0,0.15384615384615385,11.0,1.0,1.0])                                                         |11        |
|(14,[2,7,8,9,13],[1.0,43921.0,1.0,1.0,1.0])                                                                                         |1         |
+------------------------------------------------------------------------------------------------------------------------------------+----------+
only showing top 5 rows

顺便说一句,我从另一篇SO帖子中发现,我可以将功能映射到它们的名称,如下所示.在一个或多个Pandas数据框中为每个群集的每个功能提供摘要统计信息(平均值,中位数,std,最小值,最大值)会很好.

As an aside, I found from another SO post that I can map the features to their names like below. It would be nice to have summary statistics (mean, median, std, min, max) for each feature of each cluster in one or more Pandas dataframes.

attr_list = [attr for attr in chain(*transformed.schema['features'].metadata['ml_attr']['attrs'].values())]
attr_list

注释中的每个请求都是一个快照,其中包含2条数据记录(不想提供太多记录-此处为专有信息)

Per request in the comments, here is a snapshot consisting of 2 records of the data (don't want to provide too many records -- proprietary information here)

+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+
|device_type_robot_pct|device_type_smart_tv_pct|device_type_desktop_pct|device_type_tablet_pct|device_type_mobile_pct|device_type_mobile_persist_pct|visitors_seen_with_anonymiser_pct|ip_time_span|          ip_weight|mean_ips_per_visitor|visitors_seen_with_multi_country_pct|international_visitors_pct|visitors_seen_with_multi_ua_pct|count_tuids_on_ip|            features|      scaledFeatures|
+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+
|                  0.0|                     0.0|                    0.0|                   0.0|                   1.0|                           1.0|                              0.0|    485014.0|               0.25|                 2.0|                                 0.0|                       0.0|                            0.0|              1.0|(14,[4,5,7,8,9,13...|(14,[4,5,7,8,9,13...|
|                  0.0|                     0.0|                    1.0|                   0.0|                   0.0|                           0.0|                              0.0|   2401233.0|                1.0|                 1.0|                                 0.0|                       0.0|                            1.0|              1.0|(14,[2,7,8,9,12,1...|(14,[2,7,8,9,12,1...|

推荐答案

正如Anony-Mousse所评论的那样,(Py)Spark ML确实比scikit-learn或其他类似软件包更受限制[em] ,而且这种功能并非微不足道;尽管如此,这是一种获取所需内容的方法(集群统计信息):

As Anony-Mousse has commented, (Py)Spark ML is indeed much more limited that scikit-learn or other similar packages, and such functionality is not trivial; nevertheless, here is a way to get what you want (cluster statistics):

spark.version
# u'2.2.0'

from pyspark.ml.clustering import KMeans
from pyspark.ml.linalg import Vectors

# toy data - 5-d features including sparse vectors
df = spark.createDataFrame(
 [(Vectors.sparse(5,[(0, 164.0),(1,520.0)]), 1.0),
  (Vectors.dense([519.0,2723.0,0.0,3.0,4.0]), 1.0),
  (Vectors.sparse(5,[(0, 2868.0), (1, 928.0)]), 1.0),
  (Vectors.sparse(5,[(0, 57.0), (1, 2715.0)]), 0.0),
  (Vectors.dense([1241.0,2104.0,0.0,0.0,2.0]), 1.0)],
 ["features", "target"])

df.show()
# +--------------------+------+ 
# |            features|target| 
# +--------------------+------+ 
# |(5,[0,1],[164.0,5...|   1.0|
# |[519.0,2723.0,0.0...|   1.0| 
# |(5,[0,1],[2868.0,...|   1.0|
# |(5,[0,1],[57.0,27...|   0.0| 
# |[1241.0,2104.0,0....|   1.0|
# +--------------------+------+

kmeans = KMeans(k=3, seed=1)
model = kmeans.fit(df.select('features'))

transformed = model.transform(df).select("features", "prediction")
transformed.show()
# +--------------------+----------+
# |            features|prediction|
# +--------------------+----------+
# |(5,[0,1],[164.0,5...|         1| 
# |[519.0,2723.0,0.0...|         2|
# |(5,[0,1],[2868.0,...|         0|
# |(5,[0,1],[57.0,27...|         2|
# |[1241.0,2104.0,0....|         2|
# +--------------------+----------+

到这里,关于您的第一个问题:

Up to here, and regarding your first question:

我该如何解释transformed的内容?

features列只是原始数据中同一列的复制.

The features column is just a replication of the same column in your original data.

prediction列是相应数据记录所属的集群;在我的示例中,有5条数据记录和k=3群集,最终在群集#0中有1条记录,在群集#1中有1条记录,在群集#2中有3条记录.

The prediction column is the cluster to which the respective data record belongs to; in my example, with 5 data records and k=3 clusters, I end up with 1 record in cluster #0, 1 record in cluster #1, and 3 records in cluster #2.

关于第二个问题:

如何从transformed创建一个或多个Pandas DataFrame,以显示14个群集中每一个的13个功能的摘要统计信息?

How do I create one or more Pandas DataFrame from transformed that would show summary statistics for each of the 13 features for each of the 14 clusters?

(注意:似乎您具有 14 功能,而不是13 ...)

(Note: seems you have 14 features and not 13...)

这是看似简单的任务的一个很好的例子,不幸的是,PySpark无法提供现成的功能-尤其是因为所有功能都分组在单个向量features中;为此,我们必须首先反汇编" features,有效地提出

This is a good example of a seemingly simple task for which, unfortunately, PySpark does not provide ready functionality - not least because all features are grouped in a single vector features; to do that, we must first "disassemble" features, effectively coming up with the invert operation of VectorAssembler.

我目前想到的唯一方法是暂时还原为RDD并执行map操作;这是我上面的群集2的示例,其中包含密集和稀疏矢量:

The only way I can presently think of is to revert temporarily to an RDD and perform a map operation ; here is an example with my cluster #2 above, which contains both dense and sparse vectors:

# keep only cluster #2:
cl_2 = transformed.filter(transformed.prediction==2)
cl_2.show() 
# +--------------------+----------+ 
# |            features|prediction|
# +--------------------+----------+
# |[519.0,2723.0,0.0...|         2|
# |(5,[0,1],[57.0,27...|         2|
# |[1241.0,2104.0,0....|         2| 
# +--------------------+----------+

# set the data dimensionality as a parameter:
dimensionality = 5

cluster_2 = cl_2.drop('prediction').rdd.map(lambda x: [float(x[0][i]) for i in range(dimensionality)]).toDF(schema=['x'+str(i) for i in range(dimensionality)])
cluster_2.show()
# +------+------+---+---+---+ 
# |    x0|    x1| x2| x3| x4|
# +------+------+---+---+---+
# | 519.0|2723.0|0.0|3.0|4.0|
# |  57.0|2715.0|0.0|0.0|0.0| 
# |1241.0|2104.0|0.0|0.0|2.0|
# +------+------+---+---+---+

(如果您的初始数据位于Spark数据框initial_data中,则可以将最后一部分更改为toDF(schema=initial_data.columns),以保留原始特征名称.)

(If you have your initial data in a Spark dataframe initial_data, you can change the last part to toDF(schema=initial_data.columns), in order to keep the original feature names.)

从这一点出发,您可以将cluster_2数据框转换为大熊猫格式(如果它适合您的内存),或者使用

From this point, you could either convert cluster_2 dataframe to a pandas one (if it fits in your memory), or use the describe() function of Spark dataframes to get your summary statistics:

cluster_2.describe().show()
# result:
+-------+-----------------+-----------------+---+------------------+---+ 
|summary|               x0|               x1| x2|                x3| x4|
+-------+-----------------+-----------------+---+------------------+---+ 
|  count|                3|                3|  3|                 3|  3|
|   mean|605.6666666666666|           2514.0|0.0|               1.0|2.0|
| stddev|596.7389155512932|355.0929455790413|0.0|1.7320508075688772|2.0|
|    min|             57.0|           2104.0|0.0|               0.0|0.0|
|    max|           1241.0|           2723.0|0.0|               3.0|4.0|
+-------+-----------------+-----------------+---+------------------+---+

在您的情况下,将以上代码与dimensionality=14配合使用即可完成工作...

Using the above code with dimensionality=14 in your case should do the job...

是否为meanstddev中的所有这些(可能是无用的)有效数字所困扰?作为奖励,这是我前一段时间出现的一个小实用函数进行摘要:

Annoyed with all these (arguably useless) significant digits in mean and stddev? As a bonus, here is a small utility function I had come up some time ago for a pretty summary:

def prettySummary(df):
    """ Neat summary statistics of a Spark dataframe
    Args:
        pyspark.sql.dataframe.DataFrame (df): input dataframe
    Returns:
        pandas.core.frame.DataFrame: a pandas dataframe with the summary statistics of df
    """
    import pandas as pd
    temp = df.describe().toPandas()
    temp.iloc[1:3,1:] = temp.iloc[1:3,1:].convert_objects(convert_numeric=True)
    pd.options.display.float_format = '{:,.2f}'.format
    return temp

stats_df = prettySummary(cluster_2)
stats_df
# result:
    summary     x0       x1   x2   x3   x4
 0  count        3        3    3    3    3 
 1   mean   605.67 2,514.00 0.00 1.00 2.00 
 2 stddev   596.74   355.09 0.00 1.73 2.00 
 3    min     57.0   2104.0  0.0  0.0  0.0 
 4    max   1241.0   2723.0  0.0  3.0  4.0


更新:再次考虑它,并查看示例数据,我想出了一个更直接的解决方案,无需调用中间RDD(可以说应该避免使用该操作,如果可能)...


UPDATE: Thinking of it again, and seeing your sample data, I came up with a more straightforward solution, without the need to invoke an intermediate RDD (an operation that one would arguably prefer to avoid, if possible)...

关键观察内容是transformed的完整内容,即不包含select语句;保持与上面相同的玩具数据集,我们得到:

The key observation is the complete contents of transformed, i.e. without the select statements; keeping the same toy dataset as above, we get:

transformed = model.transform(df)  # no 'select' statements
transformed.show()
# +--------------------+------+----------+
# |            features|target|prediction| 
# +--------------------+------+----------+
# |(5,[0,1],[164.0,5...|   1.0|         1|
# |[519.0,2723.0,0.0...|   1.0|         2|
# |(5,[0,1],[2868.0,...|   1.0|         0|
# |(5,[0,1],[57.0,27...|   0.0|         2|
# |[1241.0,2104.0,0....|   1.0|         2|
# +--------------------+------+----------+

如您所见,要转换的数据帧df中存在任何其他列(在我的情况下仅为一个-target),只是传递"了转换过程,最终出现在最终结果...

As you can see, whatever other columns are present in the dataframe df to be transformed (just one in my case - target) just "pass-through" the transformation procedure and end-up being present in the final outcome...

希望您开始明白:如果df包含您最初的14个功能,则每个功能都在单独的列中,以及第15列名为features的列(大致如示例数据所示,但没有最后一列) ,然后输入以下代码:

Hopefully you start getting the idea: if df contains your initial 14 features, each one in a separate column, plus a 15th column named features (roughly as shown in your sample data, but without the last column), then the following code:

kmeans = KMeans().setK(14)
model = kmeans.fit(df.select('features'))
transformed = model.transform(df).drop('features')

将为您提供一个包含15列的Spark数据框transformed,即您最初的14个功能以及带有相应群集编号的prediction列.

will leave you with a Spark dataframe transformed containing 15 columns, i.e. your initial 14 features plus a prediction column with the corresponding cluster number.

从这一点来看,您可以按照我上面显示的那样从transformed进入filter特定集群并获得摘要统计信息,但是您可以避免(昂贵……)转换为中间临时RDD,从而将您的所有操作保持在Spark数据帧的更高效上下文中...

From this point, you can proceed as I have shown above to filter specific clusters from transformed and get your summary statistics, but you'll have avoided the (costly...) conversion to intermediate temporary RDDs, thus keeping all your operations in the more efficient context of Spark dataframes...

这篇关于PySpark ML:获取KMeans集群统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆