在星火ML / pyspark创建特征向量编程 [英] Create feature vector programmatically in Spark ML / pyspark

查看:775
本文介绍了在星火ML / pyspark创建特征向量编程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道是否有对在pyspark一个数据帧运行ML(例如KMEANS)一个简洁的方式,如果我在多个数字列的功能。

即。如光圈数据集:

 (A1 = 5.1,A2 = 3.5,A3 = 1.4,A4 = 0.2,ID = u'id_1',标签= u'Iris-setosa',binomial_label = 1)

我想使用KMEANS不与特征向量重新创建DataSet中手动添加一个新的列和原始列硬$ C $在code反复CD。

我想解决的办法,以改善:

 从pyspark.mllib.linalg进口矢量
从pyspark.sql.types进口排
从pyspark.ml.clustering进口KMEANS,KMeansModel光圈= sqlContext.read.parquet(/选择/数据/ iris.parquet)
iris.first()
#行(A1 = 5.1,A2 = 3.5,A3 = 1.4,A4 = 0.2,ID = u'id_1',标记= u'Iris-setosa',binomial_label = 1)DF = iris.map(拉姆达R:行(
                    ID = r.id,
                    A1 = r.a1,
                    A2 = r.a2,
                    A3 = r.a3,
                    A4 = r.a4,
                    标签= r.label,
                    binomial_label = r.binomial_label,
                    功能= Vectors.dense(r.a1,r.a2,r.a3,r.a4))
                    ).toDF()
kmeans_estimator = KMEANS()\\
    .setFeaturesCol(特征)\\
    .SET predictionCol(prediction)\\
kmeans_transformer = kmeans_estimator.fit(DF)predicted_df = kmeans_transformer.transform(DF).drop(特征)
predicted_df.first()
#行(A1 = 5.1,A2 = 3.5,A3 = 1.4,A4 = 0.2,binomial_label = 1,ID = u'id_1',标记= u'Iris-setosa',prediction = 1)

我正在寻找一个解决方案,这是一样的东西:

  feature_cols = [A1,A2,A3,A4]
prediction_col_name =prediction
<数据帧独立code为KMEANS>
<新的数据框被创建,扩展了`prediction`柱过夜。;


解决方案

您可以使用<一个href=\"http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler\"相对=nofollow> VectorAssembler

 从pyspark.ml.feature进口VectorAssembler无视= ['身份证','标签','binomial_label']
汇编= VectorAssembler(
    inputCols = [X在df.columns x如果X,它不会在忽略]
    outputCol ='功能')assembler.transform(DF)

可以使用ML管道K-均值相结合:

 从pyspark.ml进口管道管道=管线(阶段= [汇编,kmeans_estimator])
模型= pipeline.fit(DF)

I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns.

I.e. as in the Iris dataset:

(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)

I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code.

The solution I'd like to improve:

from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row
from pyspark.ml.clustering import KMeans, KMeansModel

iris = sqlContext.read.parquet("/opt/data/iris.parquet")
iris.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)

df = iris.map(lambda r: Row(
                    id = r.id,
                    a1 = r.a1,
                    a2 = r.a2,
                    a3 = r.a3,
                    a4 = r.a4,
                    label = r.label,
                    binomial_label=r.binomial_label,
                    features = Vectors.dense(r.a1, r.a2, r.a3, r.a4))
                    ).toDF()


kmeans_estimator = KMeans()\
    .setFeaturesCol("features")\
    .setPredictionCol("prediction")\
kmeans_transformer = kmeans_estimator.fit(df)

predicted_df = kmeans_transformer.transform(df).drop("features")
predicted_df.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1', label=u'Iris-setosa', prediction=1)

I'm looking for a solution, which is something like:

feature_cols = ["a1", "a2", "a3", "a4"]
prediction_col_name = "prediction"
<dataframe independent code for KMeans>
<New dataframe is created, extended with the `prediction` column.>

解决方案

You can use VectorAssembler:

from pyspark.ml.feature import VectorAssembler

ignore = ['id', 'label', 'binomial_label']
assembler = VectorAssembler(
    inputCols=[x for x in df.columns if x not in ignore],
    outputCol='features')

assembler.transform(df)

It can be combined with k-means using ML Pipeline:

from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[assembler, kmeans_estimator])
model = pipeline.fit(df)

这篇关于在星火ML / pyspark创建特征向量编程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆