如何强制 Spark 内联评估 DataFrame 操作 [英] How to force Spark to evaluate DataFrame operations inline

查看:21
本文介绍了如何强制 Spark 内联评估 DataFrame 操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Spark RDD 文档:

Spark 中的所有转换都是惰性的,因为它们不会立即计算结果...这种设计使 Spark 能够更有效地运行.

All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently.

有时我需要对我的数据帧进行某些操作现在和现在.但是因为数据帧操作是懒惰评估"(如上所述),当我在代码中编写这些操作时,几乎不能保证 Spark 会实际内联执行这些操作其余的代码.例如:

There are times when I need to do certain operations on my dataframes right then and now. But because dataframe ops are "lazily evaluated" (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For example:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

// Now we need to do a union RIGHT HERE AND NOW, because
// the next few lines of code require the union to have
// already taken place!
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)

// Now do some stuff with 'unionDataFrame'...

所以我的解决方法(到目前为止)是运行 .show().count() 紧跟我的时间敏感数据帧操作,像这样:

So my workaround for this (so far) has been to run .show() or .count() immediately following my time-sensitive dataframe op, like so:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
unionDataFrame.count()  // Forces the union to execute/compute

// Now do some stuff with 'unionDataFrame'...

...这强制 Spark 立即执行数据帧 op,然后在那里,内联.

...which forces Spark to execute the dataframe op right then in there, inline.

这对我来说感觉非常糟糕/笨拙.所以我问:是否有更普遍接受和/或更有效的方法来强制数据帧操作按需发生(而不是懒惰地评估)?

This feels awfully hacky/kludgy to me. So I ask: is there a more generally-accepted and/or efficient way to force dataframe ops to happen on-demand (and not be lazily evaluated)?

推荐答案

.

你必须调用一个动作来强制 Spark 做实际的工作.转换不会触发这种效果,这就是喜欢 .

You have to call an action to force Spark to do actual work. Transformations won't trigger that effect, and that's one of the reasons to love spark.

顺便说一下,我很确定 非常清楚什么时候必须此时此地",所以您可能关注错了点.

By the way, I am pretty sure that spark knows very well when something must be done "right here and now", so probably you are focusing on the wrong point.

你能不能确认一下 count()show() 被认为是动作"

Can you just confirm that count() and show() are considered "actions"

您可以在文档中看到Spark的一些动作功能,其中列出了 count().show() 不是,我之前也没有用过,但是感觉就像是一个动作——不做实际工作怎么能显示结果呢?:)

You can see some of the action functions of Spark in the documentation, where count() is listed. show() is not, and I haven't used it before, but it feels like it is an action-how can you show the result without doing actual work? :)

您是在暗示 Spark 会自动接受并进行联合(及时)吗?

Are you insinuating that Spark would automatically pick up on that, and do the union (just in time)?

是的!:)

记住了转换 你已经调用了,当一个 action 出现时,它会在正确的时间执行它们!

spark remembers the transformations you have called, and when an action appears, it will do them, just in -the right- time!

要记住的一点:由于此政策,只有在 action 出现时才进行实际工作,您将不会在 转换(s) 中看到逻辑错误),直到动作发生!

Something to remember: Because of this policy, of doing actual work only when an action appears, you will not see a logical error you have in your transformation(s), until the action takes place!

这篇关于如何强制 Spark 内联评估 DataFrame 操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆