Spark ML梯度增强树未使用所有节点 [英] Spark ML gradient boosted trees not using all nodes

查看:94
本文介绍了Spark ML梯度增强树未使用所有节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 pyspark 中使用Spark ML GBTClassifier 在AWS EMR集群上具有约40万行和约9000列的数据帧上训练二进制分类模型.我正在将其与我当前的解决方案进行比较,当前的解决方案是在巨大的EC2上运行XGBoost,该EC2可以适合内存中的整个数据帧.

I'm using the Spark ML GBTClassifier in pyspark to train a binary classification model on a dataframe with ~400k rows and ~9k columns on an AWS EMR cluster. I'm comparing this against my current solution, which is running XGBoost on a huge EC2 that can fit the whole dataframe in memory.

我希望我可以在Spark中更快地训练(并给新的观察结果评分),因为它可以是分布式的/并行的.但是,当观察我的集群(通过神经节)时,我发现只有3-4个节点具有活动的CPU,而其余节点仅位于那里.实际上,从外观上看,它可能仅使用一个节点进行实际训练.

My hope was that I could train (and score new observations) much faster in Spark because it would be distributed/parallel. However, when watch my cluster (through ganglia) I see that only 3-4 nodes have active CPU while the rest of the nodes are just sitting there. In fact, from the looks of it, it may be using only one node for the actual training.

我似乎在文档中找不到有关节点限制或分区的任何内容,也似乎找不到与发生这种情况的原因有关的任何内容.也许我只是误解了算法的实现,但是我认为它的实现方式是可以并行训练以利用Spark的EMR/集群方面.如果不是这样,与仅在单个EC2上的内存中进行处理相比,以这种方式进行处理是否有任何优势?我猜您不必将数据加载到内存中,但这并不是很大的优势.

I can't seem to find anything in the documentation about a node limit or partitions or anything that seems relevant to why this is happening. Maybe I'm just misunderstanding the implementation of the algorithm, but I assumed that it was implemented such a way that training could be parallelized to take advantage of the EMR/cluster aspect of Spark. If not, is there any advantage to doing it this way vs. just doing it in memory on a single EC2? I guess you don't have to load the data into memory, but that's not really much of an advantage.

这是我的代码的一些样板.感谢您的任何想法!

Here is some boilerplate of my code. Thanks for any ideas!

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Start Spark context:
sc = pyspark.SparkContext()
sqlContext = SparkSession.builder.enableHiveSupport().getOrCreate()

# load data
df = sqlContext.sql('SELECT label, features FROM full_table WHERE train = 1')
df.cache()
print("training data loaded: {} rows".format(df.count()))

test_df = sqlContext.sql('SELECT label, features FROM full_table WHERE train = 0')
test_df.cache()
print("test data loaded: {} rows".format(test_df.count()))


#Create evaluator
evaluator = BinaryClassificationEvaluator()
evaluator.setRawPredictionCol('prob')
evaluator.setLabelCol('label')

# train model
gbt = GBTClassifier(maxIter=100, 
                    maxDepth=3, 
                    stepSize=0.1,
                    labelCol="label", 
                    seed=42)

model = gbt.fit(df)


# get predictions
gbt_preds = model.transform(test_df)
gbt_preds.show(10)


# evaluate predictions
getprob=udf(lambda v:float(v[1]),DoubleType())
preds = gbt_preds.withColumn('prob', getprob('probability'))\
        .drop('features', 'rawPrediction', 'probability', 'prediction')
preds.show(10)

auc = evaluator.evaluate(preds)
auc

旁注:我正在使用的表已经矢量化.该模型使用此代码运行,运行速度很慢(需要约10-15分钟的训练时间),并且仅使用3-4(或其中只有一个)内核.

Sidenote: the tables I am using are already vectorized. The model runs with this code, it just runs slow (~10-15 mins to train) and only uses 3-4 (or maybe only one of the) cores.

推荐答案

感谢您的澄清意见.

Spark的实现不必比XGBoost快.实际上,我希望您会看到什么.

It's not necessary that Spark's implementation is faster than XGBoost. In fact, I would expect what you're seeing.

最大的因素是XGBoost是在设计和编写时特别考虑了Gradient Boosted Trees的.另一方面,Spark是更通用的方式,很可能没有XGBoost所具有的那种优化.请参阅此处,了解XGBoost和scikit-learn的分类器算法的实现.如果您想真正了解细节,可以阅读本文,甚至可以阅读XGBoost和Spark的实现背后的代码.

The biggest factor is that XGBoost was designed and written specifically with Gradient Boosted Trees in mind. Spark, on the other hand is way more general purpose and most likely doesn't have the same kind of optimizations that XGBoost has. See here for a difference between XGBoost and scikit-learn's implementation of the classifier algorithm. If you want to really get into the details, you can read the paper and even the code behind XGBoost and Spark's implementations.

请记住,XGBoost也是并行/分布式的.它仅在同一台计算机上使用多个线程.当数据无法容纳在一台机器上时,Spark可帮助您运行算法.

Remember, XGBoost is also parallel/distributed. It just uses multiple threads on the same machine. Spark helps you run the algorithm when the data doesn't fit on a single machine.

我可以想到的其他几个小问题是:a)Spark确实具有很短的启动时间.不同机器之间的通信也可以累加起来.b)XGBoost用C ++编写,通常对数值计算很有用.

A couple other minor points I can think of are a) Spark does have a non-trivial startup time. Communication across different machines can also add up. b) XGBoost is written in C++ which is in general great for numerical computation.

关于Spark为什么只使用3-4个核心的原因,这取决于您的数据集大小,如何在节点之间分布,spark启动的执行程序数量是多少,占用了哪个阶段?大部分时间,内存配置等.您可以使用Spark UI尝试弄清楚发生了什么.很难说为什么不查看数据集就会发生这种情况.

As for why only 3-4 cores are being used by Spark, that depends on what your dataset size is, how it is being distributed across nodes, what is the number of executors that spark is launching, which stage is taking up most time, memory configuration, etc. You can use Spark UI to try and figure out what's going on. It's hard to say why that's happening for your dataset without looking at it.

希望有帮助.

我刚刚找到了一个比较简单的Spark应用程序和独立的Java应用程序之间的执行时间的好答案- https://stackoverflow.com/a/49241051/5509005 .同样的原则也适用于此,实际上,由于XGBoost已高度优化,因此更是如此.

I just found this great answer comparing execution times between a simple Spark application vs a standalone java application - https://stackoverflow.com/a/49241051/5509005. Same principles apply here as well, in fact much more so since XGBoost is highly optimized.

这篇关于Spark ML梯度增强树未使用所有节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆