在星火ML / pyspark创建特征向量编程 [英] Create feature vector programmatically in Spark ML / pyspark
问题描述
我不知道是否有对在pyspark一个数据帧运行ML(例如KMEANS)一个简洁的方式,如果我在多个数字列的功能。
即。如光圈
数据集:
(A1 = 5.1,A2 = 3.5,A3 = 1.4,A4 = 0.2,ID = u'id_1',标签= u'Iris-setosa',binomial_label = 1)
我想使用KMEANS不与特征向量重新创建DataSet中手动添加一个新的列和原始列硬$ C $在code反复CD。
我想解决的办法,以改善:
从pyspark.mllib.linalg进口矢量
从pyspark.sql.types进口排
从pyspark.ml.clustering进口KMEANS,KMeansModel光圈= sqlContext.read.parquet(/选择/数据/ iris.parquet)
iris.first()
#行(A1 = 5.1,A2 = 3.5,A3 = 1.4,A4 = 0.2,ID = u'id_1',标记= u'Iris-setosa',binomial_label = 1)DF = iris.map(拉姆达R:行(
ID = r.id,
A1 = r.a1,
A2 = r.a2,
A3 = r.a3,
A4 = r.a4,
标签= r.label,
binomial_label = r.binomial_label,
功能= Vectors.dense(r.a1,r.a2,r.a3,r.a4))
).toDF()
kmeans_estimator = KMEANS()\\
.setFeaturesCol(特征)\\
.SET predictionCol(prediction)\\
kmeans_transformer = kmeans_estimator.fit(DF)predicted_df = kmeans_transformer.transform(DF).drop(特征)
predicted_df.first()
#行(A1 = 5.1,A2 = 3.5,A3 = 1.4,A4 = 0.2,binomial_label = 1,ID = u'id_1',标记= u'Iris-setosa',prediction = 1)
我正在寻找一个解决方案,这是一样的东西:
feature_cols = [A1,A2,A3,A4]
prediction_col_name =prediction
<数据帧独立code为KMEANS>
<新的数据框被创建,扩展了`prediction`柱过夜。;
您可以使用<一个href=\"http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler\"相对=nofollow> VectorAssembler
:
从pyspark.ml.feature进口VectorAssembler无视= ['身份证','标签','binomial_label']
汇编= VectorAssembler(
inputCols = [X在df.columns x如果X,它不会在忽略]
outputCol ='功能')assembler.transform(DF)
可以使用ML管道K-均值相结合:
从pyspark.ml进口管道管道=管线(阶段= [汇编,kmeans_estimator])
模型= pipeline.fit(DF)
I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns.
I.e. as in the Iris
dataset:
(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code.
The solution I'd like to improve:
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row
from pyspark.ml.clustering import KMeans, KMeansModel
iris = sqlContext.read.parquet("/opt/data/iris.parquet")
iris.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
df = iris.map(lambda r: Row(
id = r.id,
a1 = r.a1,
a2 = r.a2,
a3 = r.a3,
a4 = r.a4,
label = r.label,
binomial_label=r.binomial_label,
features = Vectors.dense(r.a1, r.a2, r.a3, r.a4))
).toDF()
kmeans_estimator = KMeans()\
.setFeaturesCol("features")\
.setPredictionCol("prediction")\
kmeans_transformer = kmeans_estimator.fit(df)
predicted_df = kmeans_transformer.transform(df).drop("features")
predicted_df.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1', label=u'Iris-setosa', prediction=1)
I'm looking for a solution, which is something like:
feature_cols = ["a1", "a2", "a3", "a4"]
prediction_col_name = "prediction"
<dataframe independent code for KMeans>
<New dataframe is created, extended with the `prediction` column.>
You can use VectorAssembler
:
from pyspark.ml.feature import VectorAssembler
ignore = ['id', 'label', 'binomial_label']
assembler = VectorAssembler(
inputCols=[x for x in df.columns if x not in ignore],
outputCol='features')
assembler.transform(df)
It can be combined with k-means using ML Pipeline:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler, kmeans_estimator])
model = pipeline.fit(df)
这篇关于在星火ML / pyspark创建特征向量编程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!