在Spark的map函数内部运行ML算法 [英] Run ML algorithm inside map function in Spark
问题描述
所以我已经尝试了几天,以在Spark的map函数中运行ML算法.我发布了一个更具体的问题但引用了Spark的ML算法给我以下错误:
So I have been trying for some days now to run ML algorithms inside a map function in Spark. I posted a more specific question but referencing Spark's ML algorithms gives me the following error:
AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized?
很明显,我不能在apply_classifier
函数中引用SparkContext
.
我的代码与上一个问题所建议的类似,但仍未找到我要寻找的解决方案:
Obviously I cannot reference SparkContext
inside the apply_classifier
function.
My code is similar to what was suggested in the previous question I asked but still haven't found a solution to what I am looking for:
def apply_classifier(clf):
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
if clf == 0:
clf = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
elif clf == 1:
clf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=5)
classifiers = [0, 1]
sc.parallelize(classifiers).map(lambda x: apply_classifier(x)).collect()
我尝试使用flatMap
而不是map
,但是得到了NoneType object is not iterable
.
I have tried using flatMap
instead of map
but I get NoneType object is not iterable
.
我还想将广播的数据集(它是一个DataFrame)作为参数传递给apply_classifier
函数内部.
最后,可以做我想做的事吗?有哪些替代方案?
I would also like to pass a broadcasted dataset (which is a DataFrame) as parameter inside the apply_classifier
function.
Finally, is it possible to do what I am trying to do? What are the alternatives?
推荐答案
有可能做我想做的事吗?
is it possible to do what I am trying to do?
不是. Apache Spark不支持任何形式的嵌套,并且分布式操作只能由驱动程序初始化.这包括访问分布式数据结构,例如Spark DataFrame
.
It is not. Apache Spark doesn't support any form of nesting and distributed operations can be initialized only by the driver. This includes access to distributed data structures, like Spark DataFrame
.
有哪些替代方案?
What are the alternatives?
这取决于许多因素,例如数据大小,可用资源量以及算法选择.通常,您有三个选择:
This depends on many factors like the size of the data, amount of available resources, and choice of algorithms. In general you have three options:
-
仅将Spark用作任务管理工具来训练本地非分布式模型.看来您已经在某种程度上探索了这条道路.对于此方法的更高级实现,您可以检查
spark-sklearn
.
通常,当数据相对较小时,此方法特别有用.它的优点是多个工作之间没有竞争.
In general this approach is particularly useful when data is relatively small. Its advantage is that there is no competition between multiple jobs.
使用标准的多线程工具从单个上下文中提交多个独立的作业.您可以使用例如 threading
或
Use standard multithreading tools to submit multiple independent jobs from a single context. You can use for example threading
or joblib
.
虽然可以采用这种方法,但我不建议在实践中使用它.并非所有的Spark组件都是线程安全的,因此您必须非常小心,以免发生意外行为.它还使您几乎无法控制资源分配.
While this approach is possible I wouldn't recommend it in practice. Not all Spark components are thread-safe and you have to pretty careful to avoid unexpected behaviors. It also gives you very little control over resource allocation.
对您的Spark应用程序进行参数设置并使用外部管道管理器( Apache Airflow , Luigi ,
Parametrize your Spark application and use external pipeline manager (Apache Airflow, Luigi, Toil) to submit your jobs.
虽然这种方法有一些缺点(它将需要将数据保存到持久性存储中),但它也是最通用,最可靠的方法,并且可以很好地控制资源分配.
While this approach has some drawbacks (it will require saving data to a persistent storage) it is also the most universal and robust and gives a lot of control over resource allocation.
这篇关于在Spark的map函数内部运行ML算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!