如何在 PySpark 管道中使用 XGboost [英] How to use XGboost in PySpark Pipeline

查看:53
本文介绍了如何在 PySpark 管道中使用 XGboost的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想更新我的 pyspark 代码.在pyspark中,它必须将基础模型放入管道中,office demo of pipeline 使用 LogistictRegression 作为基础模型.但是,似乎无法在管道 api 中使用 XGboost 模型.我怎样才能像这样使用 pyspark

I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this

from xgboost import XGBClassifier
...
model = XGBClassifier()
model.fit(X_train, y_train)
pipeline = Pipeline(stages=[..., model, ...])
...

使用pipeline api很方便,有没有人能给点建议?谢谢.

It is convenient to use the pipeline api, so can anybody give some advices? Thanks.

推荐答案

有一个维护的(被几家公司在生产中使用)分布式 XGBoost 库,如上所述(https://github.com/dmlc/xgboost),但是从 PySpark 使用它有点棘手,有人为库的 0.72 版制作了一个有效的 pyspark 包装器,正在进行 0.8 支持.

There is a maintained (used in production by several companies) distributed XGBoost library as mentioned above (https://github.com/dmlc/xgboost), however to use it from PySpark is a bit tricky, someone made a working pyspark wrapper for version 0.72 of the library, with 0.8 support in progress.

见这里https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdbhttps://github.com/dmlc/xgboost/issues/1698 的完整讨论.

See here https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb, and https://github.com/dmlc/xgboost/issues/1698 for the full discussion.

确保 xgboost jar 位于您的 pyspark jar 路径中.

Make sure the xgboost jars are in your pyspark jar path.

这篇关于如何在 PySpark 管道中使用 XGboost的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆