持续通话要比非持续通话慢 [英] Persist slower than non-persist calls

查看：91 发布时间：2021/4/8 20:32:30 apache-spark-sql spark-dataframe

本文介绍了持续通话要比非持续通话慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的设置是:具有160 GB，48个vcore的3节点YARN群集上的Spark 2.1.动态分配已打开. spark.executor.memory = 6G ， spark.executor.cores = 6

My settings are: Spark 2.1 on a 3 node YARN cluster with 160 GB, 48 vcores. Dynamic allocation turned on. spark.executor.memory=6G, spark.executor.cores=6

首先，我正在阅读配置单元表:订单(329MB)和订单项(1.43GB)，以及做一个左外连接.接下来，我根据联合应用了7种不同的过滤条件数据集(类似于 var line1 = joindDf.filter("linenumber = 1")， var line2 = joindDf.filter("l_linenumber = 2")等).因为我要对联接的数据集进行多次过滤，所以我认为进行持久化( MEMORY_ONLY )会有所帮助，因为联接的数据集将完全适合内存.

First, I am reading hive tables: orders (329MB) and lineitems (1.43GB) and doing a left outer join. Next, I apply 7 different filter conditions based on the joined dataset (something like var line1 = joinedDf.filter("linenumber=1"), var line2 = joinedDf.filter("l_linenumber=2"), etc). Because I'm filtering on the joined dataset multiple times, I thought doing a persist (MEMORY_ONLY) would help here as the joined dataset will fits fully in memory.

我注意到使用persist时，Spark应用程序的运行时间要比不使用persist时要长(3.5分钟vs 3.3分钟).对于持久性，DAG显示为持久性创建了一个阶段，其他下游作业正在等待持久性完成.这是否意味着持续存在阻塞电话?还是在持久块可用时其他作业的阶段开始处理?

I noticed that with persist, the Spark application takes longer to run than without persist (3.5 mins vs 3.3 mins). With persist, the DAG shows that a single stage was created for persist and other downstream jobs are waiting for the persist to complete. Does that mean persist is a blocking call? Or do stages in other jobs start processing when persisted blocks become available?

在非持久性情况下，不同的作业将创建不同的阶段来读取相同的数据.在不同的阶段多次读取数据，但事实证明，这比持久化情况要快.

In the non-persist case, different jobs are creating different stages to read the same data. Data is read multiple times in different stages, but this is still is turning out to be faster than the persist case.

对于较大的数据集，持久保留实际上会使执行程序用尽内存(Java堆空间).没有持久性，Spark作业就可以顺利完成.我在这里查看了其他一些建议:Spark java.lang.OutOfMemoryError:Java堆空间.我尝试增加/减少执行程序核心，并坚持执行仅使用磁盘，增加分区，修改存储比率，但似乎对执行程序内存问题没有任何帮助.

With larger data sets, persist actually causes executors to run out of memory (Java heap space). Without persist, the Spark jobs complete just fine. I looked at some other suggestions here: Spark java.lang.OutOfMemoryError: Java heap space. I tried increasing/decreasing executor cores, persisting with disk only, increasing partitions, modifying the storage ratio, but nothing seems to help with executor memory issues.

如果有人能提到持久性是如何工作的，在什么情况下它比不持久更快，更重要的是，如何解决内存不足问题，我将不胜感激.

I would appreciate it if someone could mention how persist works, in what cases it is faster than not-persisting and more importantly, how to go about troubleshooting out of memory issues.

持续通话要比非持续通话慢 [英] Persist slower than non-persist calls

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

持续通话要比非持续通话慢 [英] Persist slower than non-persist calls

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭