对于“迭代算法",转换为 RDD 然后再转换回 Dataframe 有什么好处 [英] For "iterative algorithms," what is the advantage of converting to an RDD then back to a Dataframe

查看:28
本文介绍了对于“迭代算法",转换为 RDD 然后再转换回 Dataframe 有什么好处的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读高性能 Spark,作者提出以下声明:

I am reading High Performance Spark and the author makes the following claim:

虽然 Catalyst 优化器非常强大,但它目前遇到的挑战之一是非常大的查询计划.这些查询计划往往是迭代算法的结果,例如图算法或机器学习算法.一个简单的解决方法是在每次迭代结束时将数据转换为 RDD 并返回到 DataFrame/Dataset,如例 3-58 所示.

While the Catalyst optimizer is quite powerful, one of the cases where it currently runs into challenges is with very large query plans. These query plans tend to be the result of iterative algorithms, like graph algorithms or machine learning algorithms. One simple workaround for this is converting the data to an RDD and back to DataFrame/Dataset at the end of each iteration, as shown in Example 3-58.

示例 3-58 被标记为Round trip through RDD to cut query plan",复制如下:

Example 3-58 is labeled "Round trip through RDD to cut query plan" and is reproduced below:

val rdd = df.rdd
rdd.cache()
sqlCtx.createDataFrame(rdd. df.schema)

有谁知道需要这种变通方法的根本原因是什么?

Does anyone know what is the underlying reason that makes this workaround necessary?

作为参考,已针对此问题提交了错误报告,可从以下链接获取:https://issues.apache.org/jira/browse/SPARK-13​​346

For reference, a bug report has been filed for this issue and is available at the following link: https://issues.apache.org/jira/browse/SPARK-13346

似乎没有修复,但维护者已经关闭了这个问题,并且似乎不认为他们需要解决它.

There does not appear to be a fix, but the maintainers have closed the issue and do not seem to believe they need to address it.

推荐答案

根据我的理解,迭代算法的谱系不断增长,即

From my understanding the lineage keeps on growing in iterative algorithms, i.e.

第一步:读取DF1、DF2

step 1: read DF1, DF2

第 2 步:根据 DF2 值更新 DF1

step 2: update DF1 based on DF2 value

第 3 步:读取 DF3

step 3: read DF3

第 4 步:根据 DF3 值更新 DF1

step 4: update DF1 based on DF3 value

……等等……

在这种情况下,DF1 谱系不断增长,除非使用 DF1.rdd 截断它,否则它会在 20 次左右迭代后使驱动程序崩溃..

In this scenario DF1 lineage keeps on growing and unless its truncated using DF1.rdd it will crash the driver after 20 or so iterations..

这篇关于对于“迭代算法",转换为 RDD 然后再转换回 Dataframe 有什么好处的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆