为其中多次引用该数据帧的单个Action Spark应用程序缓存数据帧是否有效? [英] Is it efficient to cache a dataframe for a single Action Spark application in which that dataframe is referenced more than once?

查看:45
本文介绍了为其中多次引用该数据帧的单个Action Spark应用程序缓存数据帧是否有效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我几乎不对Spark的缓存机制感到困惑.

比方说,我有一个Spark应用程序,在多次转换结束时只有一个操作.假设我有一个数据框A,并对其应用了2-3转换,则创建了多个数据框,这最终有助于创建将要保存到磁盘的最后一个数据框.

示例:

  val A = spark.read()//大尺寸val B = A.map()val C = A.map()...val D = B.join(C)D.save() 

那我需要缓存数据帧A来提高性能吗?

谢谢.

解决方案

是的,您是正确的.

您应缓存A,因为它用于B&C作为输入.DAG可视化将显示重用或返回源的程度(在这种情况下).如果群集嘈杂,可能会发生一些溢出到磁盘的情况.

另请参见此处的最佳答案

并且没有.cache

QED:因此,.cache有好处.否则,这是没有意义的.另外,在某些情况下,两次读取可能会导致不同的结果.

I am little confused with the caching mechanism of Spark.

Let's say I have a Spark application with only one action at the end of multiple transformations. In which suppose I have a dataframe A and I applied 2-3 transformation on it, creating multiple dataframes which eventually helps creating a last dataframe which is going to be saved to disk.

example :

val A=spark.read() // large size
val B=A.map()
val C=A.map()
.
.
.
val D=B.join(C)
D.save()

So do I need to cache dataframe A for performance enhancement?

Thanks in advance.

解决方案

Yes, you are correct.

You should cache A as it used for B & C as input. The DAG visualization would show the extent of reuse or going back to source (in this case). If you have a noisy cluster, some spilling to disk could occur.

See also top answer here (Why) do we need to call cache or persist on a RDD

However, I was looking for skipped stages, silly me. But something else shows as per below.

The following code akin to your own code:

val aa = spark.sparkContext.textFile("/FileStore/tables/filter_words.txt")//.cache
val a = aa.flatMap(x => x.split(" ")).map(_.trim) 
val b=a.map(x => (x,1)) 
val c=a.map(x => (x,2)) 
val d=b.join(c)
d.count

Looking at UI with .cache

and without .cache

QED: SO, .cache has benefit. It would not make sense otherwise. Also, 2 reads could lead to different results in some cases.

这篇关于为其中多次引用该数据帧的单个Action Spark应用程序缓存数据帧是否有效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆