在给定RDD的情况下如何训练SparkML梯度提升分类器 [英] How to train SparkML gradient boosting classifer given a RDD

查看:107
本文介绍了在给定RDD的情况下如何训练SparkML梯度提升分类器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出以下rdd

training_rdd = rdd.select(
    # Categorical features
    col('device_os'), # 'ios', 'android'

    # Numeric features
    col('30day_click_count'), 
    col('30day_impression_count'),
    np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'),

    # label
    col('did_click').alias('label')
)

我对训练梯度增强分类器的语法感到困惑.

I am confused about the syntax to train a gradient boosting classifer.

我正在关注本教程. https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier

但是,我不确定如何将4个要素列放入向量中.因为VectorIndexer假定所有功能都已经在同一列中.

However, I am unsure about how to get my 4 feature columns into a vector. Because VectorIndexer assumes that all the features are already in one column.

推荐答案

您可以使用 VectorAssembler 生成特征向量.请注意,您必须先将 rdd 转换为 DataFrame .

You can use VectorAssembler to generate the feature vector. Please note that you will have to convert your rdd to a DataFrame first.

from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler()

vectorizer.setInputCols(["device_os",
                         "30day_click_count",
                         "30day_impression_count",
                         "30day_click_through_rate"])

vectorizer.setOutputCol("features")

因此,您需要将 vectorizer 作为第一阶段放入 Pipeline :

And consequently, you will need to put vectorizer as the first stage into the Pipeline:

pipeline = Pipeline([vectorizer, ...])

这篇关于在给定RDD的情况下如何训练SparkML梯度提升分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆