在 Spark ML/pyspark 中以编程方式创建特征向量 [英] Create feature vector programmatically in Spark ML / pyspark

查看:34
本文介绍了在 Spark ML/pyspark 中以编程方式创建特征向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有一种简洁的方法可以在 pyspark 中的 DataFrame 上运行 ML(例如 KMeans),如果我有多个数字列中的功能.

I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns.

即如在 Iris 数据集中:

(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)

我想使用 KMeans 而不用手动添加特征向量作为新列和在代码中重复硬编码的原始列重新创建数据集.

I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code.

我想改进的解决方案:

from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row
from pyspark.ml.clustering import KMeans, KMeansModel

iris = sqlContext.read.parquet("/opt/data/iris.parquet")
iris.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)

df = iris.map(lambda r: Row(
                    id = r.id,
                    a1 = r.a1,
                    a2 = r.a2,
                    a3 = r.a3,
                    a4 = r.a4,
                    label = r.label,
                    binomial_label=r.binomial_label,
                    features = Vectors.dense(r.a1, r.a2, r.a3, r.a4))
                    ).toDF()


kmeans_estimator = KMeans()
    .setFeaturesCol("features")
    .setPredictionCol("prediction")
kmeans_transformer = kmeans_estimator.fit(df)

predicted_df = kmeans_transformer.transform(df).drop("features")
predicted_df.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1', label=u'Iris-setosa', prediction=1)

我正在寻找解决方案,类似于:

I'm looking for a solution, which is something like:

feature_cols = ["a1", "a2", "a3", "a4"]
prediction_col_name = "prediction"
<dataframe independent code for KMeans>
<New dataframe is created, extended with the `prediction` column.>

推荐答案

您可以使用 VectorAssembler:

You can use VectorAssembler:

from pyspark.ml.feature import VectorAssembler

ignore = ['id', 'label', 'binomial_label']
assembler = VectorAssembler(
    inputCols=[x for x in df.columns if x not in ignore],
    outputCol='features')

assembler.transform(df)

它可以与使用 ML Pipeline 的 k-means 结合:

It can be combined with k-means using ML Pipeline:

from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[assembler, kmeans_estimator])
model = pipeline.fit(df)

这篇关于在 Spark ML/pyspark 中以编程方式创建特征向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆