XGBoost每个工人集成一个火花模型 [英] XGBoost Spark One Model Per Worker Integration

查看:85
本文介绍了XGBoost每个工人集成一个火花模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试浏览此笔记本 https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1526931011080774/3624187670661048/6320440561800420/latest.html .

使用spark版本2.4.3和xgboost 0.90

Using spark version 2.4.3 and xgboost 0.90

尝试执行时请继续收到此错误ValueError: bad input shape () ...

Keep getting this error ValueError: bad input shape () when trying to execute ...

features = inputTrainingDF.select("features").collect()
lables = inputTrainingDF.select("label").collect()

X = np.asarray(map(lambda v: v[0].toArray(), features))
Y = np.asarray(map(lambda v: v[0], lables))

xgbClassifier = xgb.XGBClassifier(max_depth=3, seed=18238, objective='binary:logistic')

model = xgbClassifier.fit(X, Y)
ValueError: bad input shape () 

def trainXGbModel(partitionKey, labelAndFeatures):
  X = np.asarray(map(lambda v: v[1].toArray(), labelAndFeatures))
  Y = np.asarray(map(lambda v: v[0], labelAndFeatures))
  xgbClassifier = xgb.XGBClassifier(max_depth=3, seed=18238, objective='binary:logistic' )
  model =  xgbClassifier.fit(X, Y)
  return [partitionKey, model]

xgbModels = inputTrainingDF\
.select("education", "label", "features")\
.rdd\
.map(lambda row: [row[0], [row[1], row[2]]])\
.groupByKey()\
.map(lambda v: trainXGbModel(v[0], list(v[1])))

xgbModels.take(1)
ValueError: bad input shape ()

您可以在笔记本中看到它对发布者有效.我的猜测是它与XY np.asarray()映射有关,因为逻辑只是试图将标签和特征映射到函数,但是形状为空.使用此代码即可正常工作

You can see in the notebook it is working for whoever posted it. My guess is it has something to do with the X and Y np.asarray() mapping because the logic is just trying to map the label and features to the function but the shapes are empty. Got it working using this code

pandasDF = inputTrainingDF.toPandas()
series = pandasDF['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)
features = np.apply_along_axis(lambda x : x[0], 1, series)
target = pandasDF['label'].values
xgbClassifier = xgb.XGBClassifier(max_depth=3, seed=18238, objective='binary:logistic' )
model = xgbClassifier.fit(features, target)

但是要集成到原始函数调用&了解为什么原始笔记本无法正常工作.非常感谢对此进行排查的眼睛!

however want to integrate into the original function call & understand why the original notebook does not work. An extra set of eyes to troubleshoot this would be much appreciated!

推荐答案

您可能正在使用python3.问题是在python3 map函数中返回的是迭代器对象,而不是集合.解决此问题所需要做的就是更改map-> list(map(...)):

You are probably using python3. The issue is that in python3 map function returns an iterator object, rather than a collection. All you have to do to fix this example is to change map -> list(map(...)):

def trainXGbModel(partitionKey, labelAndFeatures):
  X = np.asarray(list(map(lambda v: v[1].toArray(), labelAndFeatures)))
  Y = np.asarray(list(map(lambda v: v[0], labelAndFeatures)))

或者您可以使用 np.fromiter 将可迭代对象转换为numpy数组.

Or you can use np.fromiter to convert iterable object to numpy array.

这篇关于XGBoost每个工人集成一个火花模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆