如何强制Spark内联评估DataFrame操作 [英] How to force Spark to evaluate DataFrame operations inline

查看:83
本文介绍了如何强制Spark内联评估DataFrame操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Spark RDD文档:

Spark中的所有转换都是惰性的,因为它们不会立即计算出结果...此设计使Spark可以更高效地运行.

All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently.

有时候,我需要在此时此刻对数据帧进行某些操作. .但是由于数据帧操作是"延迟评估"(如上),所以当我在代码中编写这些操作时,几乎不能保证Spark会实际上内联地执行这些操作.其余代码.例如:

There are times when I need to do certain operations on my dataframes right then and now. But because dataframe ops are "lazily evaluated" (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For example:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

// Now we need to do a union RIGHT HERE AND NOW, because
// the next few lines of code require the union to have
// already taken place!
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)

// Now do some stuff with 'unionDataFrame'...

因此(到目前为止),我的解决方法是运行

So my workaround for this (so far) has been to run .show() or .count() immediately following my time-sensitive dataframe op, like so:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
unionDataFrame.count()  // Forces the union to execute/compute

// Now do some stuff with 'unionDataFrame'...

...这强制执行Spark,然后立即在其中执行内联的数据框操作.

...which forces Spark to execute the dataframe op right then in there, inline.

对我来说,这真是太笨拙/笨拙了.所以我问:是否有一种更普遍接受和/或更有效的方法来强制执行数据帧操作按需发生(并且不会被懒惰地评估)?

This feels awfully hacky/kludgy to me. So I ask: is there a more generally-accepted and/or efficient way to force dataframe ops to happen on-demand (and not be lazily evaluated)?

推荐答案

.

您必须调用 action 来强制Spark进行实际工作. Transformations 不会触发这种效果,这就是爱.

You have to call an action to force Spark to do actual work. Transformations won't trigger that effect, and that's one of the reasons to love spark.

顺便说一句,我很确定非常了解何时必须在此时此刻"完成,所以您可能将重点放在了错误的地方.

By the way, I am pretty sure that spark knows very well when something must be done "right here and now", so probably you are focusing on the wrong point.

您能确认count()show()被视为操作"吗?

Can you just confirm that count() and show() are considered "actions"

您可以在文档,其中列出了count(). show()不是,并且我以前没有使用过,但是它感觉就像是一个动作-如何在不进行实际工作的情况下显示结果? :)

You can see some of the action functions of Spark in the documentation, where count() is listed. show() is not, and I haven't used it before, but it feels like it is an action-how can you show the result without doing actual work? :)

您是在暗示Spark会自动对此进行处理,并进行合并(及时)吗?

Are you insinuating that Spark would automatically pick up on that, and do the union (just in time)?

! :)

记住转换您已经调用,并且当 action 出现时,它将在正确的时间执行操作!

spark remembers the transformations you have called, and when an action appears, it will do them, just in -the right- time!

需要记住的事情:由于这项政策,只有在出现 action 时才进行实际工作,因此您不会在转换中看到逻辑错误( ),直到 action 发生!

Something to remember: Because of this policy, of doing actual work only when an action appears, you will not see a logical error you have in your transformation(s), until the action takes place!

这篇关于如何强制Spark内联评估DataFrame操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆