在Spark的map函数内部运行ML算法 [英] Run ML algorithm inside map function in Spark

查看:77
本文介绍了在Spark的map函数内部运行ML算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我已经尝试了几天,以在Spark的map函数中运行ML算法.我发布了一个更具体的问题但引用了Spark的ML算法给我以下错误:

So I have been trying for some days now to run ML algorithms inside a map function in Spark. I posted a more specific question but referencing Spark's ML algorithms gives me the following error:

AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized?

很明显,我不能在apply_classifier函数中引用SparkContext. 我的代码与上一个问题所建议的类似,但仍未找到我要寻找的解决方案:

Obviously I cannot reference SparkContext inside the apply_classifier function. My code is similar to what was suggested in the previous question I asked but still haven't found a solution to what I am looking for:

def apply_classifier(clf):
    dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
    if clf == 0:
        clf = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
    elif clf == 1:
        clf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=5)

classifiers = [0, 1]

sc.parallelize(classifiers).map(lambda x: apply_classifier(x)).collect() 

我尝试使用flatMap而不是map,但是得到了NoneType object is not iterable.

I have tried using flatMap instead of map but I get NoneType object is not iterable.

我还想将广播的数据集(它是一个DataFrame)作为参数传递给apply_classifier函数内部. 最后,可以做我想做的事吗?有哪些替代方案?

I would also like to pass a broadcasted dataset (which is a DataFrame) as parameter inside the apply_classifier function. Finally, is it possible to do what I am trying to do? What are the alternatives?

推荐答案

有可能做我想做的事吗?

is it possible to do what I am trying to do?

不是. Apache Spark不支持任何形式的嵌套,并且分布式操作只能由驱动程序初始化.这包括访问分布式数据结构,例如Spark DataFrame.

It is not. Apache Spark doesn't support any form of nesting and distributed operations can be initialized only by the driver. This includes access to distributed data structures, like Spark DataFrame.

有哪些替代方案?

What are the alternatives?

这取决于许多因素,例如数据大小,可用资源量以及算法选择.通常,您有三个选择:

This depends on many factors like the size of the data, amount of available resources, and choice of algorithms. In general you have three options:

  • 仅将Spark用作任务管理工具来训练本地非分布式模型.看来您已经在某种程度上探索了这条道路.对于此方法的更高级实现,您可以检查 spark-sklearn .

通常,当数据相对较小时,此方法特别有用.它的优点是多个工作之间没有竞争.

In general this approach is particularly useful when data is relatively small. Its advantage is that there is no competition between multiple jobs.

使用标准的多线程工具从单个上下文中提交多个独立的作业.您可以使用例如 threading

Use standard multithreading tools to submit multiple independent jobs from a single context. You can use for example threading or joblib.

虽然可以采用这种方法,但我不建议在实践中使用它.并非所有的Spark组件都是线程安全的,因此您必须非常小心,以免发生意外行为.它还使您几乎无法控制资源分配.

While this approach is possible I wouldn't recommend it in practice. Not all Spark components are thread-safe and you have to pretty careful to avoid unexpected behaviors. It also gives you very little control over resource allocation.

对您的Spark应用程序进行参数设置并使用外部管道管理器( Apache Airflow Luigi

Parametrize your Spark application and use external pipeline manager (Apache Airflow, Luigi, Toil) to submit your jobs.

虽然这种方法有一些缺点(它将需要将数据保存到持久性存储中),但它也是最通用,最可靠的方法,并且可以很好地控制资源分配.

While this approach has some drawbacks (it will require saving data to a persistent storage) it is also the most universal and robust and gives a lot of control over resource allocation.

这篇关于在Spark的map函数内部运行ML算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆