Spark ML梯度增强树未使用所有节点 [英] Spark ML gradient boosted trees not using all nodes

查看：94 发布时间：2021/4/8 19:57:59 python apache-spark pyspark apache-spark-ml

本文介绍了Spark ML梯度增强树未使用所有节点的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 pyspark 中使用Spark ML GBTClassifier 在AWS EMR集群上具有约40万行和约9000列的数据帧上训练二进制分类模型.我正在将其与我当前的解决方案进行比较，当前的解决方案是在巨大的EC2上运行XGBoost，该EC2可以适合内存中的整个数据帧.

I'm using the Spark ML GBTClassifier in pyspark to train a binary classification model on a dataframe with ~400k rows and ~9k columns on an AWS EMR cluster. I'm comparing this against my current solution, which is running XGBoost on a huge EC2 that can fit the whole dataframe in memory.

我希望我可以在Spark中更快地训练(并给新的观察结果评分)，因为它可以是分布式的/并行的.但是，当观察我的集群(通过神经节)时，我发现只有3-4个节点具有活动的CPU，而其余节点仅位于那里.实际上，从外观上看，它可能仅使用一个节点进行实际训练.

My hope was that I could train (and score new observations) much faster in Spark because it would be distributed/parallel. However, when watch my cluster (through ganglia) I see that only 3-4 nodes have active CPU while the rest of the nodes are just sitting there. In fact, from the looks of it, it may be using only one node for the actual training.

我似乎在文档中找不到有关节点限制或分区的任何内容，也似乎找不到与发生这种情况的原因有关的任何内容.也许我只是误解了算法的实现，但是我认为它的实现方式是可以并行训练以利用Spark的EMR/集群方面.如果不是这样，与仅在单个EC2上的内存中进行处理相比，以这种方式进行处理是否有任何优势?我猜您不必将数据加载到内存中，但这并不是很大的优势.

I can't seem to find anything in the documentation about a node limit or partitions or anything that seems relevant to why this is happening. Maybe I'm just misunderstanding the implementation of the algorithm, but I assumed that it was implemented such a way that training could be parallelized to take advantage of the EMR/cluster aspect of Spark. If not, is there any advantage to doing it this way vs. just doing it in memory on a single EC2? I guess you don't have to load the data into memory, but that's not really much of an advantage.

这是我的代码的一些样板.感谢您的任何想法！

Here is some boilerplate of my code. Thanks for any ideas!

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Start Spark context:
sc = pyspark.SparkContext()
sqlContext = SparkSession.builder.enableHiveSupport().getOrCreate()

# load data
df = sqlContext.sql('SELECT label, features FROM full_table WHERE train = 1')
df.cache()
print("training data loaded: {} rows".format(df.count()))

test_df = sqlContext.sql('SELECT label, features FROM full_table WHERE train = 0')
test_df.cache()
print("test data loaded: {} rows".format(test_df.count()))


#Create evaluator
evaluator = BinaryClassificationEvaluator()
evaluator.setRawPredictionCol('prob')
evaluator.setLabelCol('label')

# train model
gbt = GBTClassifier(maxIter=100, 
                    maxDepth=3, 
                    stepSize=0.1,
                    labelCol="label", 
                    seed=42)

model = gbt.fit(df)


# get predictions
gbt_preds = model.transform(test_df)
gbt_preds.show(10)


# evaluate predictions
getprob=udf(lambda v:float(v[1]),DoubleType())
preds = gbt_preds.withColumn('prob', getprob('probability'))\
        .drop('features', 'rawPrediction', 'probability', 'prediction')
preds.show(10)

auc = evaluator.evaluate(preds)
auc

旁注:我正在使用的表已经矢量化.该模型使用此代码运行，运行速度很慢(需要约10-15分钟的训练时间)，并且仅使用3-4(或其中只有一个)内核.

Sidenote: the tables I am using are already vectorized. The model runs with this code, it just runs slow (~10-15 mins to train) and only uses 3-4 (or maybe only one of the) cores.

Spark ML梯度增强树未使用所有节点 [英] Spark ML gradient boosted trees not using all nodes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark ML梯度增强树未使用所有节点 [英] Spark ML gradient boosted trees not using all nodes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭