如何在 PySpark 管道中使用 XGboost [英] How to use XGboost in PySpark Pipeline
问题描述
我想更新我的 pyspark 代码.在pyspark中,它必须将基础模型放入管道中,office demo of pipeline 使用 LogistictRegression 作为基础模型.但是,似乎无法在管道 api 中使用 XGboost 模型.我怎样才能像这样使用 pyspark
I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this
from xgboost import XGBClassifier
...
model = XGBClassifier()
model.fit(X_train, y_train)
pipeline = Pipeline(stages=[..., model, ...])
...
使用pipeline api很方便,有没有人能给点建议?谢谢.
It is convenient to use the pipeline api, so can anybody give some advices? Thanks.
推荐答案
有一个维护的(被几家公司在生产中使用)分布式 XGBoost 库,如上所述(https://github.com/dmlc/xgboost),但是从 PySpark 使用它有点棘手,有人为库的 0.72 版制作了一个有效的 pyspark 包装器,正在进行 0.8 支持.
There is a maintained (used in production by several companies) distributed XGBoost library as mentioned above (https://github.com/dmlc/xgboost), however to use it from PySpark is a bit tricky, someone made a working pyspark wrapper for version 0.72 of the library, with 0.8 support in progress.
见这里https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb 和 https://github.com/dmlc/xgboost/issues/1698 的完整讨论.
See here https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb, and https://github.com/dmlc/xgboost/issues/1698 for the full discussion.
确保 xgboost jar 位于您的 pyspark jar 路径中.
Make sure the xgboost jars are in your pyspark jar path.
这篇关于如何在 PySpark 管道中使用 XGboost的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!