PySpark数据框性能调整 [英] PySpark Dataframe Performance Tuning

查看:70
本文介绍了PySpark数据框性能调整的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试合并一些脚本;让我们一次读取数据库,而不是让每个脚本从Hive读取相同的数据.因此,请一次阅读;处理许多模型.

I am trying to consolidate some scripts; to give us one read of the DB rather than every script reading the same data from Hive. So moving to a read-once; process many model.

我坚持了数据框&每次聚合后将输出重新分区;但是我需要它更快,如果有的话,那些事情会减慢它的速度.每天我们有20 TB以上的数据,因此我假设持久存储数据(如果将被多次读取)将使处理速度更快,但事实并非如此.

I've persisted the dataframes & repartition the output after each aggregation; but I need it to be faster, if anything, those things have slowed it down. We have 20TB+ of data per day, so I had assumed that persisting the data, if it's going to be read many times, would make things faster, but it hasn't.

此外,我有很多工作是根据相同的数据进行的,如下所示.我们可以并行运行它们吗?DF2可以定义&输出与DF3的定义同时发生,以帮助加快速度?

Also, I have lots of jobs that happen from the same data, like below. Can we run them in parallel. Can DF2 definition & output happen at the same time as the definition of DF3 to help speed it up?

df = definedf....persist()
df2 = df.groupby....
df3 = df.groupby....
....

是否可以定义其他脚本可以调用的全局缓存数据框?

Is it possible to define a globally cached dataframe that other scripts can call on?

非常感谢!

推荐答案

持久化您的DF并不能保证它会被持久保存,这取决于您在工作节点上拥有的存储内存比例,以及您是否做了 .persist(),那么Spark默认情况下将使用 MEMORY_ONLY 存储配置,该配置将把 Dataframe 缓存到您在存储内存中所占的数量其余的每次使用时都会重新计算(对其执行任何操作).

Persisting your DF doesn't guarantee that it actually gets persisted, it depends on the storage memory fraction you have on your worker nodes and if you just did .persist() then Spark will by default use MEMORY_ONLY storage configuration which says that it will cache the Dataframe to the amount you have in your Storage Memory fraction and the rest will be recomputed every time you will use it(perform any action on it).

我建议您增加工作节点上的内存,如果您不执行任何密集的计算,则可以减少执行内存,而且JVM需要花费大量时间进行序列化和反序列化,因此然后,可以通过设置 spark.memory.offHeap.enabled属性使用OFF堆内存(默认情况下禁用),堆外使用Spark Tungsten Format有效地存储数据.

I would suggest you to increase the memory on your worker nodes and if you didn't perform any intensive calculation then your can reduce the execution memory, also JVM takes lot of time to serialise and de-serialise so if there is so much data then you can use OFF- Heap memory (by-default is disabled) by setting spark.memory.offHeap.enabled property, off heap uses Spark Tungsten Format for storing data efficiently.

这篇关于PySpark数据框性能调整的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆