将两个 Spark mllib 管道连接在一起 [英] Join two Spark mllib pipelines together

查看:40
本文介绍了将两个 Spark mllib 管道连接在一起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个独立的 DataFrames,每个都有几个不同的处理阶段,我在管道中使用 mllib 转换器来处理.

I have two separate DataFrames which each have several differing processing stages which I use mllib transformers in a pipeline to handle.

我现在想将这两个管道连接在一起,保留每个 DataFrame 的功能(列).

I now want to join these two pipelines together, keeping the features (columns) from each DataFrame.

Scikit-learn 有 FeatureUnion 类来处理这个问题,我似乎找不到 mllib 的等价物.

Scikit-learn has the FeatureUnion class for handling this, and I can't seem to find an equivalent for mllib.

我可以在一个管道的末尾添加一个自定义转换器阶段,将另一个管道生成的 DataFrame 作为属性并将其加入到转换方法中,但这看起来很混乱.

I can add a custom transformer stage at the end of one pipeline that takes the DataFrame produced by the other pipeline as an attribute and join it in the transform method, but that seems messy.

推荐答案

PipelinePipelineModel 是有效的 PipelineStages,因此可以是合并在一个 Pipeline 中.例如:

Pipeline or PipelineModel are valid PipelineStages, and as such can be combined in a single Pipeline. For example with:

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler

df = spark.createDataFrame([
    (1.0, 0, 1, 1, 0),
    (0.0, 1, 0, 0, 1)
], ("label", "x1", "x2", "x3", "x4"))

pipeline1 = Pipeline(stages=[
    VectorAssembler(inputCols=["x1", "x2"], outputCol="features1")
])

pipeline2 = Pipeline(stages=[
    VectorAssembler(inputCols=["x3", "x4"], outputCol="features2")
])

你可以组合Pipelines:

Pipeline(stages=[
    pipeline1, pipeline2, 
    VectorAssembler(inputCols=["features1", "features2"], outputCol="features")
]).fit(df).transform(df)

+-----+---+---+---+---+---------+---------+-----------------+
|label|x1 |x2 |x3 |x4 |features1|features2|features         |
+-----+---+---+---+---+---------+---------+-----------------+
|1.0  |0  |1  |1  |0  |[0.0,1.0]|[1.0,0.0]|[0.0,1.0,1.0,0.0]|
|0.0  |1  |0  |0  |1  |[1.0,0.0]|[0.0,1.0]|[1.0,0.0,0.0,1.0]|
+-----+---+---+---+---+---------+---------+-----------------+

或预先安装的PipelineModels:

model1 = pipeline1.fit(df)
model2 = pipeline2.fit(df)

Pipeline(stages=[
    model1, model2, 
    VectorAssembler(inputCols=["features1", "features2"], outputCol="features")
]).fit(df).transform(df)

+-----+---+---+---+---+---------+---------+-----------------+
|label| x1| x2| x3| x4|features1|features2|         features|
+-----+---+---+---+---+---------+---------+-----------------+
|  1.0|  0|  1|  1|  0|[0.0,1.0]|[1.0,0.0]|[0.0,1.0,1.0,0.0]|
|  0.0|  1|  0|  0|  1|[1.0,0.0]|[0.0,1.0]|[1.0,0.0,0.0,1.0]|
+-----+---+---+---+---+---------+---------+-----------------+

所以我推荐的方法是预先加入数据,然后fittransform整个DataFrame.

So the approach I would recommend is to join data beforehand, and fit and transform a whole DataFrame.

另见:

这篇关于将两个 Spark mllib 管道连接在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆