XGBoost Spark 每个工人集成一个模型 [英] XGBoost Spark One Model Per Worker Integration

查看:18
本文介绍了XGBoost Spark 每个工人集成一个模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试使用此笔记本 features = inputTrainingDF.select("features").collect()labels = inputTrainingDF.select("label").collect()X = np.asarray(map(lambda v: v[0].toArray(), features))Y = np.asarray(map(lambda v: v[0], 标签))xgbClassifier = xgb.XGBClassifier(max_depth=3,seed=18238,objective='binary:logistic')模型 = xgbClassifier.fit(X, Y)ValueError: 错误的输入形状 ()

def trainXGbModel(partitionKey, labelAndFeatures):X = np.asarray(map(lambda v: v[1].toArray(), labelAndFeatures))Y = np.asarray(map(lambda v: v[0], labelAndFeatures))xgbClassifier = xgb.XGBClassifier(max_depth=3,seed=18238,objective='binary:logistic')模型 = xgbClassifier.fit(X, Y)返回 [partitionKey, 型号]xgbModels = inputTrainingDF\.select("教育", "标签", "功能")\.rdd\.map(lambda 行: [row[0], [row[1], row[2]]])\.groupByKey()\.map(lambda v: trainXGbModel(v[0], list(v[1])))xgbModels.take(1)ValueError: 错误的输入形状 ()

您可以在笔记本中看到它正在为发布它的人工作.我的猜测是它与 XY np.asarray() 映射有关,因为逻辑只是试图映射标签和功能的功能,但形状是空的.使用此代码让它工作

pandasDF = inputTrainingDF.toPandas()series = pandasDF['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)特征 = np.apply_along_axis(lambda x : x[0], 1, series)target = pandasDF['label'].valuesxgbClassifier = xgb.XGBClassifier(max_depth=3,seed=18238,objective='binary:logistic')模型 = xgbClassifier.fit(特征,目标)

不过想集成到原来的函数调用&了解为什么原来的笔记本不起作用.非常感谢您提供额外的眼睛来解决此问题!

您可能正在使用 python3.问题是在 python3 map 函数返回一个迭代器对象,而不是一个集合.修复此示例所需要做的就是更改 map -> list(map(...)):

def trainXGbModel(partitionKey, labelAndFeatures):X = np.asarray(list(map(lambda v: v[1].toArray(), labelAndFeatures)))Y = np.asarray(list(map(lambda v: v[0], labelAndFeatures)))

或者您可以使用 np.fromiter将可迭代对象转换为 numpy 数组.

Trying to work through this notebook https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1526931011080774/3624187670661048/6320440561800420/latest.html.

Using spark version 2.4.3 and xgboost 0.90

Keep getting this error ValueError: bad input shape () when trying to execute ...

features = inputTrainingDF.select("features").collect()
lables = inputTrainingDF.select("label").collect()

X = np.asarray(map(lambda v: v[0].toArray(), features))
Y = np.asarray(map(lambda v: v[0], lables))

xgbClassifier = xgb.XGBClassifier(max_depth=3, seed=18238, objective='binary:logistic')

model = xgbClassifier.fit(X, Y)
ValueError: bad input shape () 

and

def trainXGbModel(partitionKey, labelAndFeatures):
  X = np.asarray(map(lambda v: v[1].toArray(), labelAndFeatures))
  Y = np.asarray(map(lambda v: v[0], labelAndFeatures))
  xgbClassifier = xgb.XGBClassifier(max_depth=3, seed=18238, objective='binary:logistic' )
  model =  xgbClassifier.fit(X, Y)
  return [partitionKey, model]

xgbModels = inputTrainingDF\
.select("education", "label", "features")\
.rdd\
.map(lambda row: [row[0], [row[1], row[2]]])\
.groupByKey()\
.map(lambda v: trainXGbModel(v[0], list(v[1])))

xgbModels.take(1)
ValueError: bad input shape ()

You can see in the notebook it is working for whoever posted it. My guess is it has something to do with the X and Y np.asarray() mapping because the logic is just trying to map the label and features to the function but the shapes are empty. Got it working using this code

pandasDF = inputTrainingDF.toPandas()
series = pandasDF['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)
features = np.apply_along_axis(lambda x : x[0], 1, series)
target = pandasDF['label'].values
xgbClassifier = xgb.XGBClassifier(max_depth=3, seed=18238, objective='binary:logistic' )
model = xgbClassifier.fit(features, target)

however want to integrate into the original function call & understand why the original notebook does not work. An extra set of eyes to troubleshoot this would be much appreciated!

解决方案

You are probably using python3. The issue is that in python3 map function returns an iterator object, rather than a collection. All you have to do to fix this example is to change map -> list(map(...)):

def trainXGbModel(partitionKey, labelAndFeatures):
  X = np.asarray(list(map(lambda v: v[1].toArray(), labelAndFeatures)))
  Y = np.asarray(list(map(lambda v: v[0], labelAndFeatures)))

Or you can use np.fromiter to convert iterable object to numpy array.

这篇关于XGBoost Spark 每个工人集成一个模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆