PySpark ML:获取KMeans集群统计信息 [英] PySpark ML: Get KMeans cluster statistics
问题描述
我建立了一个KMeansModel.我的结果存储在名为
transformed
.
I have built a KMeansModel. My results are stored in a PySpark DataFrame called
transformed
.
(a)如何解释transformed
的内容?
(a) How do I interpret the contents of transformed
?
(b)如何从transformed
创建一个或多个Pandas DataFrame,以显示14个群集中每个13个功能部件的摘要统计信息?
(b) How do I create one or more Pandas DataFrame from transformed
that would show summary statistics for each of the 13 features for each of the 14 clusters?
from pyspark.ml.clustering import KMeans
# Trains a k-means model.
kmeans = KMeans().setK(14).setSeed(1)
model = kmeans.fit(X_spark_scaled) # Fits a model to the input dataset with optional parameters.
transformed = model.transform(X_spark_scaled).select("features", "prediction") # X_spark_scaled is my PySpark DataFrame consisting of 13 features
transformed.show(5, truncate = False)
+------------------------------------------------------------------------------------------------------------------------------------+----------+
|features |prediction|
+------------------------------------------------------------------------------------------------------------------------------------+----------+
|(14,[4,5,7,8,9,13],[1.0,1.0,485014.0,0.25,2.0,1.0]) |12 |
|(14,[2,7,8,9,12,13],[1.0,2401233.0,1.0,1.0,1.0,1.0]) |2 |
|(14,[2,4,5,7,8,9,13],[0.3333333333333333,0.6666666666666666,0.6666666666666666,2429111.0,0.9166666666666666,1.3333333333333333,3.0])|2 |
|(14,[4,5,7,8,9,12,13],[1.0,1.0,2054748.0,0.15384615384615385,11.0,1.0,1.0]) |11 |
|(14,[2,7,8,9,13],[1.0,43921.0,1.0,1.0,1.0]) |1 |
+------------------------------------------------------------------------------------------------------------------------------------+----------+
only showing top 5 rows
顺便说一句,我从另一篇SO帖子中发现,我可以将功能映射到它们的名称,如下所示.在一个或多个Pandas数据框中为每个群集的每个功能提供摘要统计信息(平均值,中位数,std,最小值,最大值)会很好.
As an aside, I found from another SO post that I can map the features to their names like below. It would be nice to have summary statistics (mean, median, std, min, max) for each feature of each cluster in one or more Pandas dataframes.
attr_list = [attr for attr in chain(*transformed.schema['features'].metadata['ml_attr']['attrs'].values())]
attr_list
注释中的每个请求都是一个快照,其中包含2条数据记录(不想提供太多记录-此处为专有信息)
Per request in the comments, here is a snapshot consisting of 2 records of the data (don't want to provide too many records -- proprietary information here)
+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+
|device_type_robot_pct|device_type_smart_tv_pct|device_type_desktop_pct|device_type_tablet_pct|device_type_mobile_pct|device_type_mobile_persist_pct|visitors_seen_with_anonymiser_pct|ip_time_span| ip_weight|mean_ips_per_visitor|visitors_seen_with_multi_country_pct|international_visitors_pct|visitors_seen_with_multi_ua_pct|count_tuids_on_ip| features| scaledFeatures|
+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+
| 0.0| 0.0| 0.0| 0.0| 1.0| 1.0| 0.0| 485014.0| 0.25| 2.0| 0.0| 0.0| 0.0| 1.0|(14,[4,5,7,8,9,13...|(14,[4,5,7,8,9,13...|
| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0| 0.0| 2401233.0| 1.0| 1.0| 0.0| 0.0| 1.0| 1.0|(14,[2,7,8,9,12,1...|(14,[2,7,8,9,12,1...|
推荐答案
正如Anony-Mousse所评论的那样,(Py)Spark ML确实比scikit-learn或其他类似软件包更受限制[em] ,而且这种功能并非微不足道;尽管如此,这是一种获取所需内容的方法(集群统计信息):
As Anony-Mousse has commented, (Py)Spark ML is indeed much more limited that scikit-learn or other similar packages, and such functionality is not trivial; nevertheless, here is a way to get what you want (cluster statistics):
spark.version
# u'2.2.0'
from pyspark.ml.clustering import KMeans
from pyspark.ml.linalg import Vectors
# toy data - 5-d features including sparse vectors
df = spark.createDataFrame(
[(Vectors.sparse(5,[(0, 164.0),(1,520.0)]), 1.0),
(Vectors.dense([519.0,2723.0,0.0,3.0,4.0]), 1.0),
(Vectors.sparse(5,[(0, 2868.0), (1, 928.0)]), 1.0),
(Vectors.sparse(5,[(0, 57.0), (1, 2715.0)]), 0.0),
(Vectors.dense([1241.0,2104.0,0.0,0.0,2.0]), 1.0)],
["features", "target"])
df.show()
# +--------------------+------+
# | features|target|
# +--------------------+------+
# |(5,[0,1],[164.0,5...| 1.0|
# |[519.0,2723.0,0.0...| 1.0|
# |(5,[0,1],[2868.0,...| 1.0|
# |(5,[0,1],[57.0,27...| 0.0|
# |[1241.0,2104.0,0....| 1.0|
# +--------------------+------+
kmeans = KMeans(k=3, seed=1)
model = kmeans.fit(df.select('features'))
transformed = model.transform(df).select("features", "prediction")
transformed.show()
# +--------------------+----------+
# | features|prediction|
# +--------------------+----------+
# |(5,[0,1],[164.0,5...| 1|
# |[519.0,2723.0,0.0...| 2|
# |(5,[0,1],[2868.0,...| 0|
# |(5,[0,1],[57.0,27...| 2|
# |[1241.0,2104.0,0....| 2|
# +--------------------+----------+
到这里,关于您的第一个问题:
Up to here, and regarding your first question:
我该如何解释
transformed
的内容?
features
列只是原始数据中同一列的复制.
The features
column is just a replication of the same column in your original data.
prediction
列是相应数据记录所属的集群;在我的示例中,有5条数据记录和k=3
群集,最终在群集#0中有1条记录,在群集#1中有1条记录,在群集#2中有3条记录.
The prediction
column is the cluster to which the respective data record belongs to; in my example, with 5 data records and k=3
clusters, I end up with 1 record in cluster #0, 1 record in cluster #1, and 3 records in cluster #2.
关于第二个问题:
如何从
transformed
创建一个或多个Pandas DataFrame,以显示14个群集中每一个的13个功能的摘要统计信息?
How do I create one or more Pandas DataFrame from
transformed
that would show summary statistics for each of the 13 features for each of the 14 clusters?
(注意:似乎您具有 14 功能,而不是13 ...)
(Note: seems you have 14 features and not 13...)
这是看似简单的任务的一个很好的例子,不幸的是,PySpark无法提供现成的功能-尤其是因为所有功能都分组在单个向量features
中;为此,我们必须首先反汇编" features
,有效地提出
This is a good example of a seemingly simple task for which, unfortunately, PySpark does not provide ready functionality - not least because all features are grouped in a single vector features
; to do that, we must first "disassemble" features
, effectively coming up with the invert operation of VectorAssembler
.
我目前想到的唯一方法是暂时还原为RDD并执行map
操作;这是我上面的群集2的示例,其中包含密集和稀疏矢量:
The only way I can presently think of is to revert temporarily to an RDD and perform a map
operation ; here is an example with my cluster #2 above, which contains both dense and sparse vectors:
# keep only cluster #2:
cl_2 = transformed.filter(transformed.prediction==2)
cl_2.show()
# +--------------------+----------+
# | features|prediction|
# +--------------------+----------+
# |[519.0,2723.0,0.0...| 2|
# |(5,[0,1],[57.0,27...| 2|
# |[1241.0,2104.0,0....| 2|
# +--------------------+----------+
# set the data dimensionality as a parameter:
dimensionality = 5
cluster_2 = cl_2.drop('prediction').rdd.map(lambda x: [float(x[0][i]) for i in range(dimensionality)]).toDF(schema=['x'+str(i) for i in range(dimensionality)])
cluster_2.show()
# +------+------+---+---+---+
# | x0| x1| x2| x3| x4|
# +------+------+---+---+---+
# | 519.0|2723.0|0.0|3.0|4.0|
# | 57.0|2715.0|0.0|0.0|0.0|
# |1241.0|2104.0|0.0|0.0|2.0|
# +------+------+---+---+---+
(如果您的初始数据位于Spark数据框initial_data
中,则可以将最后一部分更改为toDF(schema=initial_data.columns)
,以保留原始特征名称.)
(If you have your initial data in a Spark dataframe initial_data
, you can change the last part to toDF(schema=initial_data.columns)
, in order to keep the original feature names.)
从这一点出发,您可以将cluster_2
数据框转换为大熊猫格式(如果它适合您的内存),或者使用
From this point, you could either convert cluster_2
dataframe to a pandas one (if it fits in your memory), or use the describe()
function of Spark dataframes to get your summary statistics:
cluster_2.describe().show()
# result:
+-------+-----------------+-----------------+---+------------------+---+
|summary| x0| x1| x2| x3| x4|
+-------+-----------------+-----------------+---+------------------+---+
| count| 3| 3| 3| 3| 3|
| mean|605.6666666666666| 2514.0|0.0| 1.0|2.0|
| stddev|596.7389155512932|355.0929455790413|0.0|1.7320508075688772|2.0|
| min| 57.0| 2104.0|0.0| 0.0|0.0|
| max| 1241.0| 2723.0|0.0| 3.0|4.0|
+-------+-----------------+-----------------+---+------------------+---+
在您的情况下,将以上代码与dimensionality=14
配合使用即可完成工作...
Using the above code with dimensionality=14
in your case should do the job...
是否为mean
和stddev
中的所有这些(可能是无用的)有效数字所困扰?作为奖励,这是我前一段时间出现的一个小实用函数进行摘要:
Annoyed with all these (arguably useless) significant digits in mean
and stddev
? As a bonus, here is a small utility function I had come up some time ago for a pretty summary:
def prettySummary(df):
""" Neat summary statistics of a Spark dataframe
Args:
pyspark.sql.dataframe.DataFrame (df): input dataframe
Returns:
pandas.core.frame.DataFrame: a pandas dataframe with the summary statistics of df
"""
import pandas as pd
temp = df.describe().toPandas()
temp.iloc[1:3,1:] = temp.iloc[1:3,1:].convert_objects(convert_numeric=True)
pd.options.display.float_format = '{:,.2f}'.format
return temp
stats_df = prettySummary(cluster_2)
stats_df
# result:
summary x0 x1 x2 x3 x4
0 count 3 3 3 3 3
1 mean 605.67 2,514.00 0.00 1.00 2.00
2 stddev 596.74 355.09 0.00 1.73 2.00
3 min 57.0 2104.0 0.0 0.0 0.0
4 max 1241.0 2723.0 0.0 3.0 4.0
更新:再次考虑它,并查看示例数据,我想出了一个更直接的解决方案,无需调用中间RDD(可以说应该避免使用该操作,如果可能)...
UPDATE: Thinking of it again, and seeing your sample data, I came up with a more straightforward solution, without the need to invoke an intermediate RDD (an operation that one would arguably prefer to avoid, if possible)...
关键观察内容是transformed
的完整内容,即不包含的select
语句;保持与上面相同的玩具数据集,我们得到:
The key observation is the complete contents of transformed
, i.e. without the select
statements; keeping the same toy dataset as above, we get:
transformed = model.transform(df) # no 'select' statements
transformed.show()
# +--------------------+------+----------+
# | features|target|prediction|
# +--------------------+------+----------+
# |(5,[0,1],[164.0,5...| 1.0| 1|
# |[519.0,2723.0,0.0...| 1.0| 2|
# |(5,[0,1],[2868.0,...| 1.0| 0|
# |(5,[0,1],[57.0,27...| 0.0| 2|
# |[1241.0,2104.0,0....| 1.0| 2|
# +--------------------+------+----------+
如您所见,要转换的数据帧df
中存在任何其他列(在我的情况下仅为一个-target
),只是传递"了转换过程,最终出现在最终结果...
As you can see, whatever other columns are present in the dataframe df
to be transformed (just one in my case - target
) just "pass-through" the transformation procedure and end-up being present in the final outcome...
希望您开始明白:如果df
包含您最初的14个功能,则每个功能都在单独的列中,以及第15列名为features
的列(大致如示例数据所示,但没有最后一列) ,然后输入以下代码:
Hopefully you start getting the idea: if df
contains your initial 14 features, each one in a separate column, plus a 15th column named features
(roughly as shown in your sample data, but without the last column), then the following code:
kmeans = KMeans().setK(14)
model = kmeans.fit(df.select('features'))
transformed = model.transform(df).drop('features')
将为您提供一个包含15列的Spark数据框transformed
,即您最初的14个功能以及带有相应群集编号的prediction
列.
will leave you with a Spark dataframe transformed
containing 15 columns, i.e. your initial 14 features plus a prediction
column with the corresponding cluster number.
从这一点来看,您可以按照我上面显示的那样从transformed
进入filter
特定集群并获得摘要统计信息,但是您可以避免(昂贵……)转换为中间临时RDD,从而将您的所有操作保持在Spark数据帧的更高效上下文中...
From this point, you can proceed as I have shown above to filter
specific clusters from transformed
and get your summary statistics, but you'll have avoided the (costly...) conversion to intermediate temporary RDDs, thus keeping all your operations in the more efficient context of Spark dataframes...
这篇关于PySpark ML:获取KMeans集群统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!