持久化比非持久化调用慢 [英] Persist slower than non-persist calls

查看：35 发布时间：2021/11/14 23:29:32 apache-spark-sql spark-dataframe

本文介绍了持久化比非持久化调用慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的设置是:Spark 2.1 在 160 GB、48 个 vcore 的 3 节点 YARN 集群上.动态分配开启.spark.executor.memory=6G, spark.executor.cores=6

My settings are: Spark 2.1 on a 3 node YARN cluster with 160 GB, 48 vcores. Dynamic allocation turned on. spark.executor.memory=6G, spark.executor.cores=6

首先，我正在阅读 hive 表:订单 (329MB) 和 lineitems (1.43GB) 和做左外连接.接下来，我根据加入的条件应用 7 种不同的过滤条件数据集(类似于 var line1 = connectedDf.filter("linenumber=1")、var line2 = connectedDf.filter("l_linenumber=2") 等).因为我对连接的数据集进行了多次过滤，所以我认为执行持久化 (MEMORY_ONLY) 会有所帮助，因为连接的数据集将完全适合内存.

First, I am reading hive tables: orders (329MB) and lineitems (1.43GB) and doing a left outer join. Next, I apply 7 different filter conditions based on the joined dataset (something like var line1 = joinedDf.filter("linenumber=1"), var line2 = joinedDf.filter("l_linenumber=2"), etc). Because I'm filtering on the joined dataset multiple times, I thought doing a persist (MEMORY_ONLY) would help here as the joined dataset will fits fully in memory.

我注意到，使用persist 时，Spark 应用程序的运行时间比不使用persist 时要长(3.5 分钟对3.3 分钟).对于persist，DAG 显示为persist 创建了一个阶段，其他下游作业正在等待persist 完成.这是否意味着persist 是一个阻塞调用?或者当持久块可用时，其他作业中的阶段是否开始处理?

I noticed that with persist, the Spark application takes longer to run than without persist (3.5 mins vs 3.3 mins). With persist, the DAG shows that a single stage was created for persist and other downstream jobs are waiting for the persist to complete. Does that mean persist is a blocking call? Or do stages in other jobs start processing when persisted blocks become available?

在非持久化的情况下，不同的作业会创建不同的阶段来读取相同的数据.数据在不同阶段被多次读取，但这仍然比持久情况要快.

In the non-persist case, different jobs are creating different stages to read the same data. Data is read multiple times in different stages, but this is still is turning out to be faster than the persist case.

对于较大的数据集，persist 实际上会导致执行程序耗尽内存(Java 堆空间).如果没有坚持，Spark 作业就可以正常完成.我在这里查看了一些其他建议:Spark java.lang.OutOfMemoryError: Java heap space.我尝试增加/减少执行程序核心，坚持仅使用磁盘、增加分区、修改存储比率，但似乎对执行程序内存问题没有任何帮助.

With larger data sets, persist actually causes executors to run out of memory (Java heap space). Without persist, the Spark jobs complete just fine. I looked at some other suggestions here: Spark java.lang.OutOfMemoryError: Java heap space. I tried increasing/decreasing executor cores, persisting with disk only, increasing partitions, modifying the storage ratio, but nothing seems to help with executor memory issues.

如果有人能提及持久化的工作原理，在什么情况下它比不持久化更快，更重要的是，如何解决内存不足问题，我将不胜感激.

I would appreciate it if someone could mention how persist works, in what cases it is faster than not-persisting and more importantly, how to go about troubleshooting out of memory issues.

持久化比非持久化调用慢 [英] Persist slower than non-persist calls

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

持久化比非持久化调用慢 [英] Persist slower than non-persist calls

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭